MagicVideo-V2:

Multi-Stage High-Aesthetic Video Generation

Weimin Wang*,

Jiawei Liu*,

Zhijie Lin,

Jiangqiao Yan,

Jie Wu,

ByteDance Inc. ^*Equal Contribution

arXiv

Text-to-Video Examples

"A beautiful woman, with a pink and platinum-colored ombre mohawk, facing the camera, wearing a composition of bubble wrap, cyberpunk jacket."

"A fat rabbit wearing a purple robe walking through a fantasy landscape."

"A girl is writing something on a book. Oil painting style."

"A girl with a hairband performing a song with her guitar on a warm evening at a local market, children's story book."

"A group of mongooses scuttle about, set against a desert backdrop, bathed in bright and warm earth tones."

"A lone traveller walks in a misty forest."

"A medieval witch making a poison."

"A monkey making latte art."

"A panda standing on a surfboard, in the ocean in sunset, 4k, high resolution."

"A polar bear is playing guitar."

"A strong American cowboy with dark skin stands in front of a chair."

"A young, beautiful girl in a pink dress is playing piano gracefully."

"An old-fashioned windmill surrounded by flowers, 3D design."

"At a tranquil lake, a white swan gracefully glides on the surface, its reflection dancing on the water, seen in a medium shot."

"Hulk wearing virtual reality goggles, 4k, high resolution."

"Ironman flying over a burning city, very detailed surroundings, cities are blazing, shiny iron man suit, realistic, 4k ultra high defi."

"A giant dragon sitting in a snow covered landscape, breathing fire."

"A large blob of exploding splashing rainbow paint, with an apple emerging, 8k."

"A panda taking a selfie."

"A walking figure made out of water."

"An elephant wearing a birthday hat walking on the beach."

"Flag of the US on top of a tall white mountain."

"Robot emerging from a large column of billowing black smoke, high quality."

"Teddy bears holding hands, walking down rainy 5th ave."

Comparisons with Other Methods

MagicVideo-V2		SVD-XT		Pika 1.0
MagicVideo-V2		SVD-XT		Gen-2

"Traveler walking alone in the misty forest at sunset."

"LEGO, standing Darth Vader super mario."

"1910s sitcom of everyday life and routines in society."

"Ironman flying over a burning city, very detailed surroundings, cities are blazing, shiny iron man suit, realistic, 4k ultra high defi."

"Muppet walking down the street in a red shirt, cinematic, 8k."

"A little boy is riding a bike on a park path, the wheels crunching on the gravel."

"In the swamp, a crocodile stealthily surfaces, revealing only its eyes and the tip of its nose as it moves forward."

"A fox dressed in suit dancing in park."

"A fat rabbit wearing a purple robe walking through a fantasy landscape."

"A panda standing on a surfboard, in the ocean in sunset, 4k, high resolution."

"Burning chicken running around on fire."

"Flying through an intense battle between pirate ships in a stormy ocean."

More Comparisons

Abstract

The growing demand for high-fidelity video generation from textual descriptions has catalyzed significant research in this field. In this work, we introduce MagicVideo-V2 that integrates the text-to-image model, video motion generator, reference image embedding module and frame interpolation module into an end-to-end video generation pipeline. Benefiting from these architecture designs, MagicVideo-V2 can generate an aesthetically pleasing, high-resolution video with remarkable fidelity and smoothness. It demonstrates superior performance over leading Text-to-Video systems such as Runway, Pika 1.0, Morph, Moon Valley and Stable Video Diffusion model via user evaluation at large scale.

Pipeline

Overview of MagicVideo-V2. The T2I module creates a 1024×1024 image that encapsulates the described scene. Subsequently, the I2V module animates this still image, generating a sequence of 600×600×32 frames, with the latent noise prior ensuring continuity from the initial frame. The V2V module enhances these frames to a 1048×048 resolution while refining the video content. Finally, the interpolation module extends the sequence to 94 frames, getting a 1048×1048 resolution video that exhibits both high aesthetic quality and temporal smoothness.

Human evaluations

Human side-by-side evaluations comparing MagicVideo-V2 with other state-of-the-art text-to-video generation methods, indicating a strong preference for MagicVideo-V2.

The distribution of human evaluators' perferences, showing a dominant inclination towards MagicVideo-V2 over other state-of-the-art T2V methods. Green, gray, and pink bars represent trials where MagicVideo-V2 was judged better, equivalent, or inferior, respectively.

BibTeX

@misc{magicvideov2,
      title={MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation},
      author={Wang, Weimin and Liu, Jiawei and Lin, Zhijie and Yan, Jiangqiao and Chen, Shuo and Low, Chetwin and Hoang, Tuyen and Wu, Jie and Liew, Jun Hao and Yan, Hanshu and Zhou, Daquan and Feng, Jiashi},
      year={2024},
      eprint={2401.04468},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}