Merge pull request #197 from hpcaitech/celaraze-main

Celaraze main
hpcaitech · Mar 23, 2024 · dfdaeae · dfdaeae
2 parents 9385014 + 0626dca
commit dfdaeae
Show file tree

Hide file tree

Showing 7 changed files with 606 additions and 248 deletions.
diff --git a/README.md b/README.md
diff --git a/docs/README_zh.md → docs/zh_CN/README.md b/docs/README_zh.md → docs/zh_CN/README.md
diff --git a/docs/zh_CN/acceleration.md b/docs/zh_CN/acceleration.md
@@ -0,0 +1,65 @@
+# 加速
+
+Open-Sora 旨在为扩散模型提供一个高速训练框架。在 64 帧 512x512 视频上训练时，我们可以实现 **55%** 的训练速度加速。我们的框架支持训练
+**1分钟1080p视频**。
+
+## 加速的 Transformer
+
+Open-Sora 通过以下方式提高训练速度：
+
+- 内核优化，包括 [flash attention](https://github.com/Dao-AILab/flash-attention), 融合 layernorm 内核以及由 colossalAI
+ 编译的内核。
+- 混合并行性，包括 ZeRO。
+- 用于更大批量的梯度检查点。
+
+我们在图像上的训练速度可与 [OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT) 相媲美，这是一个加速 DiT
+训练的项目。训练速度是在批处理大小为 128、图像大小为 256x256 的 8 个 H800 GPU 上测量的。
+
+| 模型 | 吞吐量 (img/s/GPU) | 吞吐量 (tokens/s/GPU) |
+|----------|-----------------|--------------------|
+| DiT | 100 | 26k |
+| OpenDiT | 175 | 45k |
+| OpenSora | 175 | 45k |
+
+## 高效的 STDiT
+
+我们的 STDiT 采用时空注意力对视频数据进行建模。与直接全神贯注在 Dit 相比，我们的 STDiT 随着帧数的增加而更有效率。我们当前的框架仅支持序列超长序列的并行性。
+
+训练速度是在 8 个 H800 GPU 上测量的，应用了加速技术，GC 表示梯度检查点。
+两者都具有像 PixArt 一样的 T5 调节。
+
+| 模型 | 设置 | 吞吐量 (sample/s/GPU) | 吞吐量 (tokens/s/GPU) |
+|------------------|----------------|--------------------|--------------------|
+| DiT | 16x256 (4k) | 7.20 | 29k |
+| STDiT | 16x256 (4k) | 7.00 | 28k |
+| DiT | 16x512 (16k) | 0.85 | 14k |
+| STDiT | 16x512 (16k) | 1.45 | 23k |
+| DiT (GC) | 64x512 (65k) | 0.08 | 5k |
+| STDiT (GC) | 64x512 (65k) | 0.40 | 25k |
+| STDiT (GC, sp=2) | 360x512 (370k) | 0.10 | 18k |
+
+使用 Video-VAE 在时间维度上进行 4 倍下采样时，24fps 视频有 450 帧。STDiT(28k tokens/s) 和 DiT 对图像 (高达 45k tokens/s)
+两者之间的速度差距主要来自 T5 和 VAE 编码，以及时间注意力。
+
+## 加速的编码器 (T5, VAE)
+
+在训练过程中，文本由 T5 编码，视频由 VAE 编码。通常有两种方法可以加速训练：
+
+1. 提前预处理文本和视频数据并保存到磁盘。
+2. 在训练过程中对文本和视频数据进行编码，并加快编码过程。
+
+对于选项 1，一个样本的 120 个令牌需要 1M 磁盘空间，而 64x64x64 的潜在可能需要 4M。考虑训练 包含 10M 视频剪辑的数据集，所需的总磁盘空间为
+50TB。我们的存储系统目前还没有准备好 这种数据规模。
+
+对于选项 2，我们提高了 T5 速度和内存要求。根据在[OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT)，我们发现 VAE
+消耗了大量的 GPU 内存。因此，我们
+将批大小拆分为较小的批大小，以便进行 VAE 编码。使用这两种技术，我们可以大大加快训练速度。
+
+训练速度是在 8 个带有 STDiT 的 H800 GPU 上测量的。
+
+| 加速模式 | 设置 | 吞吐量 (img/s/GPU) | 吞吐量 (tokens/s/GPU) |
+|--------------|---------------|-----------------|--------------------|
+| Baseline | 16x256 (4k) | 6.16 | 25k |
+| w. faster T5 | 16x256 (4k) | 7.00 | 29k |
+| Baseline | 64x512 (65k) | 0.94 | 15k |
+| w. both | 64x512 (65k) | 1.45 | 23k |
diff --git a/docs/commands_zh.md → docs/zh_CN/commands.md b/docs/commands_zh.md → docs/zh_CN/commands.md
diff --git a/docs/zh_CN/datasets.md b/docs/zh_CN/datasets.md
@@ -0,0 +1,31 @@
+# 数据集
+
+## 正在使用的数据集
+
+### HD-VG-130M
+
+[HD-VG-130M](https://github.com/daooshee/HD-VG-130M?tab=readme-ov-file) 包括 130M 个文本视频对。标题是
+由 BLIP-2 生成。我们发现剪切和文本质量相对较差。它包含 20 个拆分。对于 OpenSora 1.0，我们使用第一个拆分。我们计划使用整个数据集并对其进行重新处理。
+
+### Inter4k
+
+[Inter4k](https://github.com/alexandrosstergiou/Inter4K) 是一个包含分辨率为 4K 的 1k 视频剪辑的数据集。这个
+数据集被提议用于超分辨率任务。我们使用数据集进行 HQ 训练。处理过的视频可以从这里找到 [这里](README.md#数据处理) 。
+
+### Pexels.com
+
+[Pexels.com](https://www.pexels.com/) 是一个提供免费库存照片和视频的网站。我们收集的 19K 视频
+来自本网站的剪辑，用于高质量训练。处理过的视频可以从这里找到 [这里](README.md#数据处理) 。
+
+## 数据集监视列表
+
+我们也在关注以下数据集，并考虑在未来使用它们，这取决于我们的存储空间以及数据集的质量。
+
+| 名称 | 大小 | 描述 |
+|-------------------|--------------|-------------------------------|
+| Panda-70M | 70M videos | High quality video-text pairs |
+| WebVid-10M | 10M videos | Low quality |
+| InternVid-10M-FLT | 10M videos | |
+| EGO4D | 3670 hours | |
+| OpenDV-YouTube | 1700 hours | |
+| VidProM | 6.69M videos | |
diff --git a/docs/zh_CN/report_v1.md b/docs/zh_CN/report_v1.md
@@ -0,0 +1,47 @@
+# Open-Sora v1 Report
+
+OpenAI's Sora is amazing at generating one minutes high quality videos. However, it reveals almost no information about its details. To make AI more "open", we are dedicated to build an open-source version of Sora. This report describes our first attempt to train a transformer-based video diffusion model.
+
+## Efficiency in choosing the architecture
+
+To lower the computational cost, we want to utilize existing VAE models. Sora uses spatial-temporal VAE to reduce the temporal dimensions. However, we found that there is no open-source high-quality spatial-temporal VAE model. [MAGVIT](https://github.com/google-research/magvit)'s 4x4x4 VAE is not open-sourced, while [VideoGPT](https://wilson1yan.github.io/videogpt/index.html)'s 2x4x4 VAE has a low quality in our experiments. Thus, we decided to use a 2D VAE (from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original)) in our first version.
+
+The video training involves a large amount of tokens. Considering 24fps 1min videos, we have 1440 frames. With VAE downsampling 4x and patch size downsampling 2x, we have 1440x1024≈1.5M tokens. Full attention on 1.5M tokens leads to a huge computational cost. Thus, we use spatial-temporal attention to reduce the cost following [Latte](https://github.com/Vchitect/Latte).
+
+As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper. However, we do not control a similar number of parameters for these variants. While Latte's paper claims their variant is better than variant 3, our experiments on 16x256x256 videos show that with same number of iterations, the performance ranks as: DiT (full) > STDiT (Sequential) > STDiT (Parallel) ≈ Latte. Thus, we choose STDiT (Sequential) out of efficiency. Speed benchmark is provided [here](/docs/acceleration.md#efficient-stdit).
+
+![Architecture Comparison](https://i0.imgs.ovh/2024/03/15/eLk9D.png)
+
+To focus on video generation, we hope to train the model based on a powerful image generation model. [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha) is an efficiently trained high-quality image generation model with T5-conditioned DiT structure. We initialize our model with PixArt-α and initialize the projection layer of inserted temporal attention with zero. This initialization preserves model's ability of image generation at beginning, while Latte's architecture cannot. The inserted attention increases the number of parameter from 580M to 724M.
+
+![Architecture](https://i0.imgs.ovh/2024/03/16/erC1d.png)
+
+Drawing from the success of PixArt-α and Stable Video Diffusion, we also adopt a progressive training strategy: 16x256x256 on 366K pretraining datasets, and then 16x256x256, 16x512x512, and 64x512x512 on 20K datasets. With scaled position embedding, this strategy greatly reduces the computational cost.
+
+We also try to use a 3D patch embedder in DiT. However, with 2x downsampling on temporal dimension, the generated videos have a low quality. Thus, we leave the downsampling to temporal VAE in our next version. For now, we sample at every 3 frames with 16 frames training and every 2 frames with 64 frames training.
+
+## Data is the key to high quality
+
+We find that the number and quality of data have a great impact on the quality of generated videos, even larger than the model architecture and training strategy. At this time, we only prepared the first split (366K video clips) from [HD-VG-130M](https://github.com/daooshee/HD-VG-130M). The quality of these videos varies greatly, and the captions are not that accurate. Thus, we further collect 20k relatively high quality videos from [Pexels](https://www.pexels.com/), which provides free license videos. We label the video with LLaVA, an image captioning model, with three frames and a designed prompt. With designed prompt, LLaVA can generate good quality of captions.
+
+![Caption](https://i0.imgs.ovh/2024/03/16/eXdvC.png)
+
+As we lay more emphasis on the quality of data, we prepare to collect more data and build a video preprocessing pipeline in our next version.
+
+## Training Details
+
+With a limited training budgets, we made only a few exploration. We find learning rate 1e-4 is too large and scales down to 2e-5. When training with a large batch size, we find `fp16` less stable than `bf16` and may lead to generation failure. Thus, we switch to `bf16` for training on 64x512x512. For other hyper-parameters, we follow previous works.
+
+## Loss curves
+
+16x256x256 Pretraining Loss Curve
+
+![16x256x256 Pretraining Loss Curve](https://i0.imgs.ovh/2024/03/16/erXQj.png)
+
+16x256x256 HQ Training Loss Curve
+
+![16x256x256 HQ Training Loss Curve](https://i0.imgs.ovh/2024/03/16/ernXv.png)
+
+16x512x512 HQ Training Loss Curve
+
+![16x512x512 HQ Training Loss Curve](https://i0.imgs.ovh/2024/03/16/erHBe.png)
diff --git a/docs/zh_CN/structure.md b/docs/zh_CN/structure.md
@@ -0,0 +1,178 @@
+# Repo & Config Structure
+
+## Repo Structure
+
+```plaintext
+Open-Sora
+├── README.md
+├── docs
+│ ├── acceleration.md -> Acceleration & Speed benchmark
+│ ├── command.md -> Commands for training & inference
+│ ├── datasets.md -> Datasets used in this project
+│ ├── structure.md -> This file
+│ └── report_v1.md -> Report for Open-Sora v1
+├── scripts
+│ ├── train.py -> diffusion training script
+│ └── inference.py -> Report for Open-Sora v1
+├── configs -> Configs for training & inference
+├── opensora
+│ ├── __init__.py
+│ ├── registry.py -> Registry helper
+│   ├── acceleration -> Acceleration related code
+│   ├── dataset -> Dataset related code
+│   ├── models
+│   │   ├── layers -> Common layers
+│   │   ├── vae -> VAE as image encoder
+│   │   ├── text_encoder -> Text encoder
+│   │   │   ├── classes.py -> Class id encoder (inference only)
+│   │   │   ├── clip.py -> CLIP encoder
+│   │   │   └── t5.py -> T5 encoder
+│   │   ├── dit
+│   │   ├── latte
+│   │   ├── pixart
+│   │   └── stdit -> Our STDiT related code
+│   ├── schedulers -> Diffusion schedulers
+│   │   ├── iddpm -> IDDPM for training and inference
+│   │ └── dpms -> DPM-Solver for fast inference
+│ └── utils
+└── tools -> Tools for data processing and more
+```
+
+## Configs
+
+Our config files follows [MMEgine](https://github.com/open-mmlab/mmengine). MMEngine will reads the config file (a `.py` file) and parse it into a dictionary-like object.
+
+```plaintext
+Open-Sora
+└── configs -> Configs for training & inference
+ ├── opensora -> STDiT related configs
+ │ ├── inference
+ │ │ ├── 16x256x256.py -> Sample videos 16 frames 256x256
+ │ │ ├── 16x512x512.py -> Sample videos 16 frames 512x512
+ │ │ └── 64x512x512.py -> Sample videos 64 frames 512x512
+ │ └── train
+ │ ├── 16x256x256.py -> Train on videos 16 frames 256x256
+ │ ├── 16x256x256.py -> Train on videos 16 frames 256x256
+ │ └── 64x512x512.py -> Train on videos 64 frames 512x512
+ ├── dit -> DiT related configs
+    │   ├── inference
+    │   │   ├── 1x256x256-class.py -> Sample images with ckpts from DiT
+    │   │   ├── 1x256x256.py -> Sample images with clip condition
+    │   │   └── 16x256x256.py -> Sample videos
+    │   └── train
+    │     ├── 1x256x256.py -> Train on images with clip condition
+    │      └── 16x256x256.py -> Train on videos
+ ├── latte -> Latte related configs
+ └── pixart -> PixArt related configs
+```
+
+## Inference config demos
+
+To change the inference settings, you can directly modify the corresponding config file. Or you can pass arguments to overwrite the config file ([config_utils.py](/opensora/utils/config_utils.py)). To change sampling prompts, you should modify the `.txt` file passed to the `--prompt_path` argument.
+
+```plaintext
+--prompt_path ./assets/texts/t2v_samples.txt -> prompt_path
+--ckpt-path ./path/to/your/ckpt.pth -> model["from_pretrained"]
+```
+
+The explanation of each field is provided below.
+
+```python
+# Define sampling size
+num_frames = 64 # number of frames
+fps = 24 // 2 # frames per second (divided by 2 for frame_interval=2)
+image_size = (512, 512) # image size (height, width)
+
+# Define model
+model = dict(
+ type="STDiT-XL/2", # Select model type (STDiT-XL/2, DiT-XL/2, etc.)
+ space_scale=1.0, # (Optional) Space positional encoding scale (new height / old height)
+ time_scale=2 / 3, # (Optional) Time positional encoding scale (new frame_interval / old frame_interval)
+ enable_flashattn=True, # (Optional) Speed up training and inference with flash attention
+ enable_layernorm_kernel=True, # (Optional) Speed up training and inference with fused kernel
+ from_pretrained="PRETRAINED_MODEL", # (Optional) Load from pretrained model
+ no_temporal_pos_emb=True, # (Optional) Disable temporal positional encoding (for image)
+)
+vae = dict(
+ type="VideoAutoencoderKL", # Select VAE type
+ from_pretrained="stabilityai/sd-vae-ft-ema", # Load from pretrained VAE
+ micro_batch_size=128, # VAE with micro batch size to save memory
+)
+text_encoder = dict(
+ type="t5", # Select text encoder type (t5, clip)
+ from_pretrained="./pretrained_models/t5_ckpts", # Load from pretrained text encoder
+ model_max_length=120, # Maximum length of input text
+)
+scheduler = dict(
+ type="iddpm", # Select scheduler type (iddpm, dpm-solver)
+ num_sampling_steps=100, # Number of sampling steps
+ cfg_scale=7.0, # hyper-parameter for classifier-free diffusion
+)
+dtype = "fp16" # Computation type (fp16, fp32, bf16)
+
+# Other settings
+batch_size = 1 # batch size
+seed = 42 # random seed
+prompt_path = "./assets/texts/t2v_samples.txt" # path to prompt file
+save_dir = "./samples" # path to save samples
+```
+
+## Training config demos
+
+```python
+# Define sampling size
+num_frames = 64
+frame_interval = 2 # sample every 2 frames
+image_size = (512, 512)
+
+# Define dataset
+root = None # root path to the dataset
+data_path = "CSV_PATH" # path to the csv file
+use_image_transform = False # True if training on images
+num_workers = 4 # number of workers for dataloader
+
+# Define acceleration
+dtype = "bf16" # Computation type (fp16, bf16)
+grad_checkpoint = True # Use gradient checkpointing
+plugin = "zero2" # Plugin for distributed training (zero2, zero2-seq)
+sp_size = 1 # Sequence parallelism size (1 for no sequence parallelism)
+
+# Define model
+model = dict(
+ type="STDiT-XL/2",
+ space_scale=1.0,
+ time_scale=2 / 3,
+ from_pretrained="YOUR_PRETRAINED_MODEL",
+ enable_flashattn=True, # Enable flash attention
+ enable_layernorm_kernel=True, # Enable layernorm kernel
+)
+vae = dict(
+ type="VideoAutoencoderKL",
+ from_pretrained="stabilityai/sd-vae-ft-ema",
+ micro_batch_size=128,
+)
+text_encoder = dict(
+ type="t5",
+ from_pretrained="./pretrained_models/t5_ckpts",
+ model_max_length=120,
+ shardformer=True, # Enable shardformer for T5 acceleration
+)
+scheduler = dict(
+ type="iddpm",
+ timestep_respacing="", # Default 1000 timesteps
+)
+
+# Others
+seed = 42
+outputs = "outputs" # path to save checkpoints
+wandb = False # Use wandb for logging
+
+epochs = 1000 # number of epochs (just large enough, kill when satisfied)
+log_every = 10
+ckpt_every = 250
+load = None # path to resume training
+
+batch_size = 4
+lr = 2e-5
+grad_clip = 1.0 # gradient clipping
+```