🚀 Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
Yuanhao Zhai1, Kevin Lin2, Zhengyuan Yang2, Linjie Li2, Jianfeng Wang2, Chung-Ching Lin2, David Doermann1, Junsong Yuan1, Lijuan Wang2
1State University of New York at Buffalo | 2Microsoft
NeurIPS 2024
TL;DR: Our motion consistency model not only accelerates text2video diffusion model sampling process, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.
[09/2024] MCM was accepted to NeurIPS 2024!
[07/2024] Release learnable head parameter at this box link.
[06/2024] Our MCM achieves strong performance (using 4 sampling steps) on the ChronoMagic-Bench! Check out the leaderboard here.
[06/2024] Training code, pre-trained checkpoint, Gradio demo, and Colab demo release.
[06/2024] Paper and project page release.
Instead of installing diffusers, peft, and open_clip from the official repos, we use our modified versions specified in the requirements.txt file. This is particularly important for diffusers and open_clip, due to the former's current limited support for video diffusion model LoRA loading, and the latter's distributed training dependency.
To set up the environment, run the following commands:
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118 # please modify the cuda version according to your env
pip install -r requirements.txt
pip install scipy==1.11.1
pip install https://github.com/podgorskiy/dnnlib/releases/download/0.0.1/dnnlib-0.0.1-py3-none-any.whl
Please preparation the video and optional image datasets in the webdataset format.
Specifically, please wrap the video/image files and their corresponding .json format metadata into .tar files. Here is an example structure of the video .tar file:
.
├── video_0.json
├── video_0.mp4
...
├── video_n.json
└── video_n.mp4
The .json files contain video/image captions in key-value pairs, for example: {"caption": "World map in gray - world map with animated circles and binary numbers"}
.
We provide our generated anime, realistic, and 3D cartoon style image datasets here (coming soom). Due to dataset agreement, we could not publicly release the WebVid and LAION-aes dataset.
We provide a script scripts/download.py
to download the DINOv2 and CLIP checkpoint.
python scripts/download.py
Please input your wandb API key in utils/wandb.py
to enable wandb logging.
If you do not use wandb, please remove wandb
from the --report_to
argument in the training command.
We leverage accelerate for distributed training, and we support two different based text2video diffusion models: ModelScopeT2V and AnimateDiff. For both models, we train LoRA instead fine-tuning all parameters.
For ModelScopeT2V, our code supports pure video diffusion distillation training, and frame quality improvement training.
By default, the training script requires 8 GPUs, each with 80GB of GPU memory, to fit a batch size of 4. The minimal GPU memory requirement is 32GB for a batch size of 1. Please adjust the --train_batch_size
argument accordingly for different GPU memory sizes.
Before running the scripts, please modify the data path in the environment variables defined at the top of each script.
Diffusion distillation
We provide the training script in scripts/modelscopet2v_distillation.sh
bash scripts/modelscopet2v_distillation.sh
Frame quality improvement
We provide the training script in scripts/modelscopet2v_improvement.sh
. Before running, please assign the IMAGE_DATA_PATH
in the script.
bash scripts/modelscopet2v_improvement.sh
Due to the higher resolution requirement, MCM with AnimateDiff base model training requires at least 70GB of GPU memory to fit a single batch.
We provide the diffusion distillation training script in scripts/animatediff_distillation.sh
.
bash scripts/animatediff_distillation.sh
We provide our pre-trained checkpoint here, Gradio demo here, and Colab demo here. demo.py
showcases how to run our MCM in local machine.
Feel free to try out our MCM!
We provide our pre-trained checkpoint here.
For research/debug purpose, we also provide intermediate parameters and states at this box link. The folder (~1.12GB) include model weight, discriminator weight, scheduler states, optimizer states and learnable head weight.
Some of our implementations are borrowed from the great repos below.
@article{zhai2024motion,
title={Motion Consistency Model: Accelerating Video Diffusion with Disentangled
Motion-Appearance Distillation},
author={Zhai, Yuanhao and Lin, Kevin and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Chung-Ching and Doermann, David and Yuan, Junsong and Wang, Lijuan},
year={2024},
journal={arXiv preprint arXiv:2406.06890},
website={https://yhzhai.github.io/mcm/},
}