We are currently organizing the corresponding data and plan to release it within the next three months, along with establishing a leaderboard.
MovieBench Dataset. MovieBench categorizes the movie annotations into three hierarchical data levels, representing different granularities of information: 1) Movie level provides a broad overview of the film; 2) Scene level provides mid-level scene consistency information; 3) Shot level emphasizes specific moments with detailed descriptions.
With the hierarchical data MovieBench provide, existing methods can partially reconstruct a coherent storyline, maintaining consistent character identities and relationships.
Character-based Plot | Original Movie | DreanVideo | Magic-Me |
Harry Potter suddenly emerges from under a sheet in a dark room, pretending to be asleep. | |||
Harry Potter is clearing the table while Dudley and Aunt Petunia are seated, enjoying a meal. | |||
Harry Potter is walking through a dark suburban neighborhood at night. | |||
Harry Potter is inside a bus, observing his surroundings. He looks upward, examining something above him. | |||
Harry, Ron, Hermione, and Sirius are in a train compartment. They are having a serious conversation, with Sirius providing guidance or insight to the younger trio. | |||
Professor Sybil Trelawney is engaged in an animated discussion, gesturing dramatically. Her expressive behavior suggests a lively moment, possibly involving a prediction or revelation. | |||
Harry and Hermione stand in a snowy forest. Harry looks down miserably, conveying a sense of sadness or contemplation, while Hermione stands beside him, providing silent support. | |||
Hermione Granger hides behind a tree in a dense forest, appearing tense and alert. |
Character-based Plot |
Original Movie |
StoryDiffusion + CogVideoX (Open-Source) |
StoryDiffusion + Kling 1.5 (Closed Product) |
Harry Potter suddenly emerges from under a sheet in a dark room, pretending to be asleep. | |||
Aunt Petunia greets her nephew Dudley in a warmly lit room, indicating a family gathering. | |||
Ron Weasley and Hermione Granger are looking for a place to sit on the train. They discover a compartment with a man asleep inside and decide to enter. | |||
The scene depicts horseless carriages moving through a dark, rainy night towards Hogwarts Castle. The mood is mysterious and ominous, with the carriages illuminated by faint lights. | |||
Hagrid, the large giant, sits beside a long table, unintentionally bumping into it and attempting to stand up. | |||
Professor Sybil Trelawney is teaching a class on divination, engaging students with her mystical and dramatic presentation style in a dimly lit, ornate classroom. | |||
In a forest setting, students gather around as Harry Potter steps forward while others, including Ron Weasley and Hermione Granger, step back. The scene conveys tension and anticipation. | |||
Ron Weasley steps forward reluctantly in a line of students, indicating an upcoming event or decision. | |||
Harry Potter is closely examining a crystal ball, possibly in a Divination class. | |||
The scene depicts a full moon emerging from behind clouds, creating a tense and eerie mood. The landscape is barely visible under the moonlight. | |||
In a dark forest, Harry Potter uses his wand to cast a spell, emitting a bright light that repels the Dementors, showcasing his bravery and magical prowess. |
1) GT:The original video is sourced from the movie Juno, covering the time range 00:10:20.997 to 00:10:28.000.
2) Source Image Conditional:Using the initial frame of the original video along with audio as conditions for audio-driven talking human generation (Hallo2 was used for the demo).
3) Text Conditional:Using generated images (StoryDiffusion) and audio (Hallo2) as conditions for audio-driven talking human generation.
1) GT:The original video is sourced from the movie Les Miserables, covering the time range 01:48:22:566 to 01:48:31:745 and 00:13:08:141 to 00:13:24:689.
2) Source Image Conditional:Using the initial frame of the original video along with audio as conditions for audio-driven talking human generation (Hallo2 was used for the demo).
3) Text Conditional:Using generated images (StoryDiffusion) and audio (Hallo2) as conditions for audio-driven talking human generation.
MovieBench reconsiders and explores new challenges in long video generation, such as multi-object consistency and multi-view coherence.
@misc{wu2024moviebenchhierarchicalmovielevel,
title={MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation},
author={Weijia Wu and Mingyu Liu and Zeyu Zhu and Xi Xia and Haoen Feng and Wen Wang and Kevin Qinghong Lin and Chunhua Shen and Mike Zheng Shou},
year={2024},
eprint={2411.15262},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.15262},
}