This repository is the MindSpore implementation of AnimateDiff.
- Text-to-video generation with AnimdateDiff v2, supporting 16 frames @512x512 resolution on Ascend 910B, 16 frames @256x256 resolution on GPU 3090
- MotionLoRA inference
- Motion Module Training
- Motion LoRA Training
- AnimateDiff v3 Inference
- AnimateDiff v3 Training
- SDXL support
pip install -r requirements.txt
In case decord
package is not available, try pip install eva-decord
.
For EulerOS, instructions on ffmpeg and decord installation are as follows.
1. install ffmpeg 4, referring to https://ffmpeg.org/releases
wget https://ffmpeg.org/releases/ffmpeg-4.0.1.tar.bz2 --no-check-certificate
tar -xvf ffmpeg-4.0.1.tar.bz2
mv ffmpeg-4.0.1 ffmpeg
cd ffmpeg
./configure --enable-shared # --enable-shared is needed for sharing libavcodec with decord
make -j 64
make install
2. install decord, referring to https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source
git clone --recursive https://github.com/dmlc/decord
cd decord
rm build && mkdir build && cd build
cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release
make -j 64
make install
cd ../python
python3 setup.py install --user
First, download the torch pretrained weights referring to torch animatediff checkpoints.
- Convert SD dreambooth model
To download ToonYou-Beta3 dreambooth model, please refer to this civitai website, or use the following command:
wget https://civitai.com/api/download/models/78755 -P models/torch_ckpts/ --content-disposition --no-check-certificate
After downloading this dreambooth checkpoint under animatediff/models/torch_ckpts/
, convert the dreambooth checkpoint using:
cd ../examples/stable_diffusion_v2
python tools/model_conversion/convert_weights.py --source ../animatediff/models/torch_ckpts/toonyou_beta3.safetensors --target models/toonyou_beta3.ckpt --model sdv1 --source_version pt
In addition, please download RealisticVision V5.1 dreambooth checkpoint and convert it similarly.
- Convert Motion Module
cd ../examples/animatediff/tools
python motion_module_convert.py --src ../torch_ckpts/mm_sd_v15_v2.ckpt --tar ../models/motion_module
If converting the animatediff v3 motion module checkpoint,
cd ../examples/animatediff/tools
python motion_module_convert.py -v v3 --src ../torch_ckpts/v3_sd15_mm.ckpt --tar ../models/motion_module
- Convert Motion LoRA
cd ../examples/animatediff/tools
python motion_lora_convert.py --src ../torch_ckpts/.ckpt --tar ../models/motion_lora
- Convert Domain Adapter LoRA
cd ../examples/animatediff/tools
python domain_adapter_lora_convert.py --src ../torch_ckpts/v3_sd15_adapter.ckpt --tar ../models/domain_adapter_lora
- Convert SparseCtrl Encoder
cd ../examples/animatediff/tools
python sparsectrl_encoder_convert.py --src ../torch_ckpts/v3_sd15_sparsectrl_{}.ckpt --tar ../models/sparsectrl_encoder
The full tree of expected checkpoints is shown below:
models
├── domain_adapter_lora
│ └── v3_sd15_adapter.ckpt
├── dreambooth_lora
│ ├── realisticVisionV51_v51VAE.ckpt
│ └── toonyou_beta3.ckpt
├── motion_lora
│ └── v2_lora_ZoomIn.ckpt
├── motion_module
│ ├── mm_sd_v15.ckpt
│ ├── mm_sd_v15_v2.ckpt
│ └── v3_sd15_mm.ckpt
├── sparsectrl_encoder
│ ├── v3_sd15_sparsectrl_rgb.ckpt
│ └── v3_sd15_sparsectrl_scribble.ckpt
└── stable_diffusion
└── sd_v1.5-d0ab7146.ckpt
- Running On Ascend 910*:
# download demo images
bash scripts/download_demo_images.sh
# under general T2V setting
python text_to_video.py --config configs/prompts/v3/v3-1-T2V.yaml
# image animation (on RealisticVision)
python text_to_video.py --config configs/prompts/v3/v3-2-animation-RealisticVision.yaml
# sketch-to-animation and storyboarding (on RealisticVision)
python text_to_video.py --config configs/prompts/v3/v3-3-sketch-RealisticVision.yaml
Results:
Input (by RealisticVision) | Animation | Input | Animation |
Input Scribble | Output | Input Scribbles | Output |
- Running on GPU:
Please append --device_target GPU
to the end of the commands above.
If you use the checkpoint converted from torch for inference, please also append --vae_fp16=False
to the command above.
- Running On Ascend 910*:
python text_to_video.py --config configs/prompts/v2/1-ToonYou.yaml --L 16 --H 512 --W 512
By default, DDIM sampling is used, and the sampling speed is 1.07s/iter.
Results:
- Running on GPU:
python text_to_video.py --config configs/prompts/v2/1-ToonYou.yaml --L 16 --H 256 --W 256 --device_target GPU
If you use the checkpoint converted from torch for inference, please also append --vae_fp16=False
to the command above.
- Running On Ascend 910*:
python text_to_video.py --config configs/prompts/v2/1-ToonYou-MotionLoRA.yaml --L 16 --H 512 --W 512
By default, DDIM sampling is used, and the sampling speed is 1.07s/iter.
Results using Zoom-In motion lora:
- Running on GPU:
python text_to_video.py --config configs/prompts/v2/1-ToonYou-MotionLoRA.yaml --L 16 --H 256 --W 256 --device_target GPU
python train.py --config configs/training/image_finetune.yaml
For 910B, please set
export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE"
before running training.
python train.py --config configs/training/mmv2_train.yaml
For 910B, please set
export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE"
before running training.
You may change the arguments including data path, output directory, lr, etc in the yaml config file. You can also change by command line arguments referring to args_train.py
or python train.py --help
- Evaluation
To infer with the trained model, run
python text_to_video.py --config configs/prompts/v2/base_video.yaml \
--motion_module_path {path to saved checkpoint} \
--prompt {text prompt} \
You can also create a new config yaml to specify the prompts to test and the motion moduel path based on configs/prompt/v2/base_video.yaml
.
Here are some generation results after MM training on 512x512 resolution and 16-frame data.
Min-SNR weighting can be used to improve diffusion training convergence. You can enable it by appending --snr_gamma=5.0
to the training command.
python train.py --config configs/training/mmv2_lora.yaml
For 910B, please set
export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE"
before running training.
- Evaluation
To infer with the trained model, run
python text_to_video.py --config configs/prompts/v2/base_video.yaml \
--motion_lora_path {path to saved checkpoint} \
--prompt {text prompt} \
Here are some generation results after lora fine-tuning on 512x512 resolution and 16-frame data.
Please add --device_target GPU
in the above training commands and adjust image_size
/num_frames
/train_batch_size
to fit your device memory. Below is an example for 3090.
# reduce num frames and batch size to avoid OOM in 3090
python train.py --config configs/training/mmv2_train.yaml --data_path ../videocomposer/datasets/webvid5 --image_size 256 --num_frames=4 --device_target GPU --train_batch_size=1
Model | Context | Scheduler | Steps | Resolution | Frame | Speed (step/s) | Time(s/video) |
---|---|---|---|---|---|---|---|
AnimateDiff v2 | D910*x1-MS2.2.10 | DDIM | 30 | 512x512 | 16 | 1.2 | 25 |
Context: {Ascend chip}-{number of NPUs}-{mindspore version}.
Model | Context | Task | Local BS x Grad. Accu. | Resolution | Frame | Step T. (s/step) |
---|---|---|---|---|---|---|
AnimateDiff v2 | D910*x1-MS2.2.10 | MM training | 1x1 | 512x512 | 16 | 1.29 |
AnimateDiff v2 | D910*x1-MS2.2.10 | Motion Lora | 1x1 | 512x512 | 16 | 1.26 |
AnimateDiff v2 | D910*x1-MS2.2.10 | MM training w/ Embed. cached | 1x1 | 512x512 | 16 | 0.75 |
AnimateDiff v2 | D910*x1-MS2.2.10 | Motion Lora w/ Embed. cached | 1x1 | 512x512 | 16 | 0.71 |
Context: {Ascend chip}-{number of NPUs}-{mindspore version}.
MM training: Motion Module training
Embed. cached: The video embedding (VAE-encoder outputs) and text embedding are pre-computed and stored before diffusion training.