MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

Show Lab, National University of Singapore | Zhejiang University

Script to Synchronized Movie with Audio Generation

Data: MovieBench(ours) with Movie Scripts, Character Bank (Character Names, Images, and Audio). Text to StoryBoard Model: StoryDiffusion. StoryBoard+Sutitles to Video Model: Klingai. The audio-driven talking human generation model also uses Klingai, and in this demo, we have not customized the character's audio.

Dataset Download

We are currently organizing the corresponding data and plan to release it within the next three months, along with establishing a leaderboard.

Dataset

Network Structure

MovieBench Dataset. MovieBench categorizes the movie annotations into three hierarchical data levels, representing different granularities of information: 1) Movie level provides a broad overview of the film; 2) Scene level provides mid-level scene consistency information; 3) Shot level emphasizes specific moments with detailed descriptions.

Task 1: Text to Keyframe/Storyboard Generation

Text to Keyframe/Storyboard Generation

With the hierarchical data MovieBench provide, existing methods can partially reconstruct a coherent storyline, maintaining consistent character identities and relationships.

Task 2: Identity-Customized Long Video Generation








Character-based Plot Original Movie DreanVideo Magic-Me
Harry Potter suddenly emerges from under a sheet in a dark room, pretending to be asleep.
Harry Potter is clearing the table while Dudley and Aunt Petunia are seated, enjoying a meal.
Harry Potter is walking through a dark suburban neighborhood at night.
Harry Potter is inside a bus, observing his surroundings. He looks upward, examining something above him.
Harry, Ron, Hermione, and Sirius are in a train compartment. They are having a serious conversation, with Sirius providing guidance or insight to the younger trio.
Professor Sybil Trelawney is engaged in an animated discussion, gesturing dramatically. Her expressive behavior suggests a lively moment, possibly involving a prediction or revelation.
Harry and Hermione stand in a snowy forest. Harry looks down miserably, conveying a sense of sadness or contemplation, while Hermione stands beside him, providing silent support.
Hermione Granger hides behind a tree in a dense forest, appearing tense and alert.

Task 3: Keyframe-conditioned Video Generation


Character-based Plot

Original Movie
StoryDiffusion
+
CogVideoX (Open-Source)
StoryDiffusion
+
Kling 1.5 (Closed Product)
Harry Potter suddenly emerges from under a sheet in a dark room, pretending to be asleep.
Aunt Petunia greets her nephew Dudley in a warmly lit room, indicating a family gathering.
Ron Weasley and Hermione Granger are looking for a place to sit on the train. They discover a compartment with a man asleep inside and decide to enter.
The scene depicts horseless carriages moving through a dark, rainy night towards Hogwarts Castle. The mood is mysterious and ominous, with the carriages illuminated by faint lights.
Hagrid, the large giant, sits beside a long table, unintentionally bumping into it and attempting to stand up.
Professor Sybil Trelawney is teaching a class on divination, engaging students with her mystical and dramatic presentation style in a dimly lit, ornate classroom.
In a forest setting, students gather around as Harry Potter steps forward while others, including Ron Weasley and Hermione Granger, step back. The scene conveys tension and anticipation.
Ron Weasley steps forward reluctantly in a line of students, indicating an upcoming event or decision.
Harry Potter is closely examining a crystal ball, possibly in a Divination class.
The scene depicts a full moon emerging from behind clouds, creating a tense and eerie mood. The landscape is barely visible under the moonlight.
In a dark forest, Harry Potter uses his wand to cast a spell, emitting a bright light that repels the Dementors, showcasing his bravery and magical prowess.

Task 4: Audio-driven Talking Human Generation

1004_Juno_00.10.20.997-00.10.28.000

1) GT:The original video is sourced from the movie Juno, covering the time range 00:10:20.997 to 00:10:28.000.

2) Source Image Conditional:Using the initial frame of the original video along with audio as conditions for audio-driven talking human generation (Hallo2 was used for the demo).

3) Text Conditional:Using generated images (StoryDiffusion) and audio (Hallo2) as conditions for audio-driven talking human generation.

1027_Les_Miserables

1) GT:The original video is sourced from the movie Les Miserables, covering the time range 01:48:22:566 to 01:48:31:745 and 00:13:08:141 to 00:13:24:689.

2) Source Image Conditional:Using the initial frame of the original video along with audio as conditions for audio-driven talking human generation (Hallo2 was used for the demo).

3) Text Conditional:Using generated images (StoryDiffusion) and audio (Hallo2) as conditions for audio-driven talking human generation.

Challenges in Current Movie Generation Models: Bridging the Gap to True Long-Form Video Generation

Bad Case

MovieBench reconsiders and explores new challenges in long video generation, such as multi-object consistency and multi-view coherence.

BibTeX

@misc{wu2024moviebenchhierarchicalmovielevel,
      title={MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation}, 
      author={Weijia Wu and Mingyu Liu and Zeyu Zhu and Xi Xia and Haoen Feng and Wen Wang and Kevin Qinghong Lin and Chunhua Shen and Mike Zheng Shou},
      year={2024},
      eprint={2411.15262},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.15262}, 
}