Skip to content

Latest commit

 

History

History
164 lines (135 loc) · 6.67 KB

File metadata and controls

164 lines (135 loc) · 6.67 KB

Stable Video Diffusion image-to-video fine-tuning

The train_image_to_video_svd.py script shows how to fine-tune Stable Video Diffusion (SVD) on your own dataset.

🚨 This script is experimental. The script fine-tunes the whole model and often times the model overfits and runs into issues like catastrophic forgetting. It's recommended to try different hyperparamters to get the best result on your dataset. 🚨

Running locally with Paddle

Installing the dependencies

Before running the scripts, make sure to install the library's training dependencies:

Important

To make sure you can successfully run the latest versions of the example scripts, we highly recommend installing from source and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:

git clone https://github.com/PaddlePaddle/PaddleMIX
cd PaddleMIX/ppdiffusers
pip install -e .

Then cd in the examples/stable_video_diffusion folder and run

pip install -r requirements_svd.txt

Video Data Processing

We will use BDD100K as an example for training data processing. Note that BDD100K is a driving video/image dataset, but this is not a necessity for training. Any video can be used to initiate your training. Please refer to the DummyDataset data reading logic. In short, you only need to specify --train_data_dir and --valid_data_path. Then arrange your videos in the following file structure:

self.base_folder
    ├── video_name1
    │   ├── video_frame1
    │   ├── video_frame2
    │   ...
    ├── video_name2
    │   ├── video_frame1
        ├── ...

数据准备

Execute the following command to download and extract the processed dataset.

wget https://paddlenlp.bj.bcebos.com/models/community/westfish/lvdm_datasets/sky_timelapse_lvdm.zip && unzip sky_timelapse_lvdm.zip
wget https://example.com/dataset.zip && unzip dataset.zip

Training

Single machine, single GPU, V100 32G

export MODEL_NAME="stabilityai/stable-video-diffusion-img2vid-xt"
export DATASET_NAME="bdd100k"
export OUTPUT_DIR="sdv_train_output"
export VALID_DATA="valid_image"
export GLOG_minloglevel=2

python train_image_to_video_svd.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
    --max_train_steps=50000 \
    --checkpointing_steps=1000 --checkpoints_total_limit=10 \
    --learning_rate=1e-5 --lr_warmup_steps=0 \
    --seed=123 \
    --mixed_precision="fp16" \
    --validation_steps=200 \
    --output_dir=$OUTPUT_DIR \
    --train_data_dir=$DATASET_NAME \
    --valid_data_path=$VALID_DATA \
    --width=448 --height=256 --enable_xformers_memory_efficient_attention --gradient_checkpointing

Single machine, single GPU

export MODEL_NAME="stabilityai/stable-video-diffusion-img2vid-xt"
export DATASET_NAME="bdd100k"
export OUTPUT_DIR="sdv_train_output"
export VALID_DATA="valid_image"
export GLOG_minloglevel=2
export FLAGS_conv_workspace_size_limit=4096

python train_image_to_video_svd.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
    --max_train_steps=50000 \
    --width=512 \
    --height=320 \
    --checkpointing_steps=1000 --checkpoints_total_limit=1 \
    --learning_rate=1e-5 --lr_warmup_steps=0 \
    --seed=123 \
    --mixed_precision="fp16" \
    --validation_steps=200 \
    --output_dir=$OUTPUT_DIR \
    --train_data_dir=$DATASET_NAME \
    --valid_data_path=$VALID_DATA

Single machine, multiple GPUs

export MODEL_NAME="stabilityai/stable-video-diffusion-img2vid-xt"
export DATASET_NAME="bdd100k"
export OUTPUT_DIR="sdv_train_output"
export VALID_DATA="valid_image"
export GLOG_minloglevel=2
export FLAGS_conv_workspace_size_limit=4096

python train_image_to_video_svd.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
    --max_train_steps=50000 \
    --width=512 \
    --height=320 \
    --checkpointing_steps=1000 --checkpoints_total_limit=10 \
    --learning_rate=1e-5 --lr_warmup_steps=0 \
    --seed=123 \
    --mixed_precision="fp16" \
    --validation_steps=200 \
    --output_dir=$OUTPUT_DIR \
    --train_data_dir=$DATASET_NAME \
    --valid_data_path=$VALID_DATA

Notes:

  • "bf16" only supported on NVIDIA A100.

Inference

import paddle

from ppdiffusers.pipelines.stable_video_diffusion import StableVideoDiffusionPipeline
from ppdiffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "your-stable-video-diffusion-img2vid-model-path-or-id",
    paddle_dtype=paddle.float16
)

# Load the conditioning image
# image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=")
image = load_image("rocket.png")
image = image.resize((1024, 576))

generator = paddle.Generator().manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]

export_to_video(frames, "generated.mp4", fps=7)

Comparison

size=(512, 320), motion_bucket_id=127, fps=7, noise_aug_strength=0.00
generator=torch.manual_seed(111)
Init Image Before Fine-tuning After Fine-tuning
demo ori ft
demo ori ft
demo ori ft
demo ori ft

References