Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua

This is the repository that contains the source code for the CVPR 2024 paper of Dysen-VDM.

Framework architecture

⚙️ Setting environments

Install Environment via Anaconda

conda create -n dysen_vdm python=3.8.5
conda activate dysen_vdm
pip install -r requirements.txt

Download Datasets

Put all the data at dataset fold.

Pre-training corpus
- WebVid
  - WebVid is a large-scale dataset of videos with textual descriptions, where the videos are diverse and rich in their content.
  - There are 10.7M video-caption pairs, where we only use 3M text-video pairs for the pre-training of VDM.
  - The dataset can be downloaded from the official website, and save them in the dataset/webvid.
Text-to-video in-domain data
- UCF-101
  - Composed of diverse human actions, which contains 101 classes where each class label denotes a specific movement label.
  - The dataset can be downloaded from the official website, and save them in the dataset/ucf101.
- MSR-VTT
  - MSR-VTT (Microsoft Research Video to Text) is a large-scale text-video pair 715 dataset. It consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 716 English sentences by Amazon Mechanical Turks.
  - The dataset can be downloaded from the official website, and save them in the dataset/msrvtt.
- ActivityNet
  - Each video in ActivityNet connects to the descriptions with multiple actions (at least 3 actions), allowing to describe multiple complex events that occur.
  - The dataset can be found in the official website, and save them in the dataset/activityNet.

💫 Pre-training Dysen-VDM

We first pre-train the Dysen-VDM system. The pre-training process is with the dataset/WebVid text-video pair data.

Step 1: Pre-train the video autoencoder of VDM

bash shellscripts/train_vdm_autoencoder.sh

Properly set up PROJ_ROOT, DATADIR, EXPERIMENT_NAME and CONFIG, where EXPERIMENT_NAME = webvid.

Step 2: Pre-train the backbone VDM for text-conditioned video generation

bash shellscripts/run_train_vdm.sh

Properly set up PROJ_ROOT, DATADIR, AEPATH, EXPERIMENT_NAME and CONFIG, where EXPERIMENT_NAME = webvid.

This step uses gold DSG of video for the updating of recurrent graph Transformer in 3D-UNet. parse the DSG annotations in advance with the tools in dysen/DSG

Step 3: (Post-)Train the overall Dysen-VDM with dynamic scene managing

bash shellscripts/run_train_dysen_vdm.sh

properly set up EXPERIMENT_NAME, RESUME, DATADIR, CKPT_PATH and VDM_MODEL, where EXPERIMENT_NAME = webvid.
The in-context learning (ICL) process within dysen is optimized with reinforcement learning (RL). If using RL for the Imagination Rationality optimization, gold DSG of video is needed. parse the DSG annotations in advance with the tools in dysen/DSG.

🧩 Fine-tuning Dysen-VDM on in-domain data

We further update Dysen-VDM on the in-domain training set:

bash shellscripts/run_train_dysen_vdm.sh

Properly set up EXPERIMENT_NAME, RESUME, DATADIR, CKPT_PATH and VDM_MODEL, where EXPERIMENT_NAME = activityNet | msrvtt | ucf101.

💫 Evaluating

Measuring the performances of Dysen-VDM on datasets dataset:

bash shellscripts/run_eval_dysen_vdm.sh

Properly set up DATACONFIG, PREDCITPATH, GOLDPATH, EXPERIMENT_NAME, and RESDIR.

💫 Inference

Text-to-video generation with well-trained Dysen-VDM:

bash shellscripts/run_sample_vdm_text2video.sh

Contact

For any questions or feedback, feel free to contact Hao Fei.

Citation

If you find Dysen-VDM useful in your research or applications, please kindly cite:

@inproceedings{fei2024dysen,
  title={Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs},
  author={Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua},
  booktitle={Proceedings of the CVPR},
  pages={961--970},
  year={2024}
}

License Notices

This repository is under BSD 3-Clause License. Dysen-VDM is a research project intended for non-commercial use only. One must NOT use the code of Dysen-VDM for any illegal, harmful, violent, racist, or sexual purposes. One is strictly prohibited from engaging in any activity that will potentially violate these guidelines. Any potential commercial use of this code should be approved by the authors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

Framework architecture

⚙️ Setting environments

Install Environment via Anaconda

Download Datasets

💫 Pre-training Dysen-VDM

Step 1: Pre-train the video autoencoder of VDM

Step 2: Pre-train the backbone VDM for text-conditioned video generation

Step 3: (Post-)Train the overall Dysen-VDM with dynamic scene managing

🧩 Fine-tuning Dysen-VDM on in-domain data

💫 Evaluating

💫 Inference

Contact

Citation

License Notices

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

Framework architecture

⚙️ Setting environments

Install Environment via Anaconda

Download Datasets

💫 Pre-training Dysen-VDM

Step 1: Pre-train the video autoencoder of VDM

Step 2: Pre-train the backbone VDM for text-conditioned video generation

Step 3: (Post-)Train the overall Dysen-VDM with dynamic scene managing

🧩 Fine-tuning Dysen-VDM on in-domain data

💫 Evaluating

💫 Inference

Contact

Citation

License Notices