Website | Leaderboard | Slack
Welcome to the Text-Guided Video Editing (TGVE) competition of LOVEU Workshop @ CVPR 2023!
This repository contains the data, baseline code and submission guideline for the LOVEU-TGVE competition. If you have any questions, please feel free to reach out to us at [email protected].
Leveraging AI for video editing has the potential to unleash creativity for artists across all skill levels. The rapidly-advancing field of Text-Guided Video Editing (TGVE) is here to address this challenge. Recent works in this field include Tune-A-Video, Gen-2, and Dreamix.
In this competition track, we provide a standard set of videos and prompts. As a researcher, you will develop a model that takes a video and a prompt for how to edit it, and your model will produce an edited video. For instance, you might be given a video of “people playing basketball in a gym,” and your model will edit the video to “dinosaurs playing basketball on the moon.”
With this competition, we aim to offer a place where researchers can rigorously compare video editing methods. After the competition ends, we hope the LOVEU-TGVE-2023 dataset can provide a standardized way of comparing AI video editing algorithms.
- May 1, 2023: The competition data and baseline code become available.
- May 8, 2023: The leaderboard and submission instructions become available.
- June 5, 2023: Deadline for submitting your generated videos.
- June 18, 2023: LOVEU 2023 Workshop. Presentations by winner and runner-up.
We conducted a survey of text guided video editing papers, and we found the following patterns in how they evaluate their work:
- Input: 10 to 100 videos, with ~3 editing prompts per video
- Human evaluation to compare the generated videos to a baseline
We follow a similar protocol in our LOVEU-TGVE-2023 dataset. Our dataset consists of 76 videos. Each video has 4 editing prompts. All videos are creative commons licensed. Each video consists of either 32 or 128 frames, with a resolution of 480x480.
Please ensure that you complete all the necessary details and upload your edited videos and report on the LOVEU-TGVE Registration & Submission Form prior to June 5, 2023.
NOTE:
- Each team should register only once. Registering multiple times using different accounts is not permitted.
- The Google form can be edited multiple times. Each submission will overwrite the previous one.
- Only the latest submission will be sent to human evaluation.
Kindly upload a zip file named YOUR-TEAM-NAME_videos.zip
to Edited Videos portal in the Google form.
The uploaded zip file should include ALL edited prompts in the LOVEU-TGVE-2023 dataset.
Please adhere to the following format and folder structure when saving your edited videos.
YOUR-TEAM-NAME_videos.zip
├── DAVIS_480p
│ ├── stunt
│ │ ├── style
│ │ │ ├── 00000.jpg
│ │ │ ├── 00001.jpg
│ │ │ ├── ...
│ │ ├── object
│ │ │ ├── 00000.jpg
│ │ │ ├── 00001.jpg
│ │ │ ├── ...
│ │ ├── background
│ │ │ ├── 00000.jpg
│ │ │ ├── 00001.jpg
│ │ │ ├── ...
│ │ ├── multiple
│ │ │ ├── 00000.jpg
│ │ │ ├── 00001.jpg
│ │ │ ├── ...
│ ├── gold-fish
│ ├── drift-turn
│ ├── ...
├── youtube_480p
├── videvo_480p
Use CVPR style (double column) in the form of 3-6 pages or NeurIPS style (single column) in the form of 6-10 pages inclusive of any references. Please explain clearly...
- Your data, supervision, and any pre-trained models
- Pertinent hyperparameters such as classifier-free guidance scale
- If you used prompt engineering, please describe your approach
Please name your report as YOUR-TEAM-NAME_report.pdf
, and submit it to Report portal in the Google form.
After submission, your edited videos will undergo automatic evaluation on our server based on CLIP score and PickScore. Our system will calculate these scores and present them on the competition website's leaderboard. The leaderboard will be refreshed every 24 hours to showcase the most current scores. In addition, you may compute these automatic metrics locally utilizing the evaluation code provided in this repository.
After all submissions are uploaded, we will run a human-evaluation of all submitted videos. Specifically, we will have human labelers compare all submitted videos to the baseline videos that were edited with the Tune-A-Video model. Labelers will evaluate videos on the following criteria:
- Text alignment: How well does the generated video match the caption?
- Structure: How well does the generated video preserve the structure of the original video?
- Quality: Aesthetically, how good is this video?
We will choose a winner and a runner-up based on the human evaluation results.
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Jay Zhangjie Wu,
Yixiao Ge,
Xintao Wang,
Stan Weixian Lei,
Yuchao Gu,
Yufei Shi,
Wynne Hsu,
Ying Shan,
Xiaohu Qie,
Mike Zheng Shou
git clone https://github.com/showlab/loveu-tgve-2023.git
cd loveu-tgve-2023
pip install -r requirements.txt
Installing xformers is highly recommended for more efficiency and speed on GPUs.
To enable xformers, set enable_xformers_memory_efficient_attention=True
(default).
Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from HuggingFace (e.g., Stable Diffusion v1-4, v1-5 v2-1).
git lfs install
git clone https://huggingface.co/CompVis/stable-diffusion-v1-4 checkpoints/stable-diffusion-v1-4
Download the loveu-tgve-2023.zip, and unpack it to the ./data
folder.
unzip loveu-tgve-2023.zip -d ./data
Modify the required paths in scripts/create_configs.py
and run:
python scripts/create_configs.py
To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:
accelerate launch train_tuneavideo.py --config="configs/loveu-tgve-2023/DAVIS_480p/gold-fish.yaml"
To run training for all
CONFIG_PATH=./configs/loveu-tgve-2023
for config_file in $(find $CONFIG_PATH -name "*.yaml"); do
accelerate launch train_tuneavideo.py --config=$config_file
done
Tips:
- Fine-tuning a 32-frame video (480x480) requires approximately 300 to 500 steps, taking around 10 to 15 minutes when utilizing one A100 GPU (40GB).
- Fine-tuning a 128-frame video (480x480) necessitates more VRAM (more than 40GB) and can be executed using one A100 GPU (80GB). In case your VRAM is restricted, you may split the video into 32-frame clips for fine-tuning.
Once the training is done, run inference:
python test_tuneavideo.py --config="configs/loveu-tgve-2023/DAVIS_480p/gold-fish.yaml" --fp16
Convert GIF to JPG
ffmpeg -r 1 -i input.gif %05d.jpg
from PIL import Image
gif = Image.open("input.gif")
frame_index = 0
while True:
try: gif.seek(frame_index)
except EOFError: break
image = gif.convert('RGB')
image.save("{:05d}.jpg".format(frame_index))
frame_index += 1
In addition to human evaluation, we employ CLIP score and PickScore and as automatic metrics to measure the quality of generated videos.
- CLIP score for frame consistency: we compute CLIP image embeddings on all frames of output video and report the average cosine similarity between all pairs of video frames.
- CLIP score for textual alignment: we compute average CLIP score between all frames of output video and corresponding edited prompt.
- PickScore for human preference: we compute average PickScore between all frames of output video and corresponding edited prompt.
The evaluation code is provided in scripts/run_eval.py
. See Submission Section
for the format and structure of your submission folder.
python scripts/run_eval.py --submission_path="PATH_TO_YOUR_SUBMISSION_FOLDER" --metric="clip_score_text"
loveu-tgve-2023.mp4
@misc{wu2023cvpr,
title={CVPR 2023 Text Guided Video Editing Competition},
author={Jay Zhangjie Wu and Xiuyu Li and Difei Gao and Zhen Dong and Jinbin Bai and Aishani Singh and Xiaoyu Xiang and Youzeng Li and Zuwei Huang and Yuanxi Sun and Rui He and Feng Hu and Junhua Hu and Hai Huang and Hanyu Zhu and Xu Cheng and Jie Tang and Mike Zheng Shou and Kurt Keutzer and Forrest Iandola},
year={2023},
eprint={2310.16003},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{wu2023tune,
title={Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation},
author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Shi, Yufei and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={7623--7633},
year={2023}
}