From d0a9e21dad88a39a5201d715b8dc891f55f95b36 Mon Sep 17 00:00:00 2001 From: zR <2448370773@qq.com> Date: Wed, 7 Aug 2024 19:55:41 +0800 Subject: [PATCH] Update GPU Memory Cost to 24GB (#90) update GPU memory to 23.9GB --- Model_License => MODEL_LICENSE | 0 README.md | 125 +++++++++++------ README_zh.md | 127 +++++++++++------- inference/cli_demo.py | 23 ++-- .../gradio_web_demo.py | 3 +- .../{web_demo.py => streamlit_web_demo.py} | 6 +- requirements.txt | 7 +- sat/README.md | 32 ++++- sat/README_zh.md | 23 +++- 9 files changed, 230 insertions(+), 116 deletions(-) rename Model_License => MODEL_LICENSE (100%) rename gradio_demo.py => inference/gradio_web_demo.py (99%) rename inference/{web_demo.py => streamlit_web_demo.py} (97%) diff --git a/Model_License b/MODEL_LICENSE similarity index 100% rename from Model_License rename to MODEL_LICENSE diff --git a/README.md b/README.md index 4b7f3af1..b85f813c 100644 --- a/README.md +++ b/README.md @@ -20,20 +20,50 @@ ## Update and News -- 🔥 **News**: ``2024/8/6``: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct +- 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be performed + on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py). +- 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct the video almost losslessly. -- 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**,the first model in the CogVideoX series of video +- 🔥 **News**: ```2024/8/6```: We have open-sourced **CogVideoX-2B**,the first model in the CogVideoX series of video generation models. -- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch),the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details. +- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch),the **first** + open-sourced pretrained text-to-video model, and you can + check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details. **More powerful models with larger parameter sizes are on the way~ Stay tuned!** +## Table of Contents + +Jump to a specific section: + +- [Quick Start](#Quick-Start) + - [SAT](#sat) + - [Diffusers](#Diffusers) +- [CogVideoX-2B Video Works](#cogvideox-2b-gallery) +- [Introduction to the CogVideoX Model](#Model-Introduction) +- [Full Project Structure](#project-structure) + - [Inference](#inference) + - [SAT](#sat) + - [Tools](#tools) +- [Introduction to CogVideo(ICLR'23) Model](#cogvideoiclr23) +- [Citations](#Citation) +- [Open Source Project Plan](#Open-Source-Project-Plan) +- [Model License](#Model-License) + ## Quick Start +### Prompt Optimization + +Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use the GLM-4 model to +optimize the prompt. This is crucial because the model is trained with long prompts, and a good prompt directly affects +the quality of the generated video. + ### SAT -Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. - (18 GB for inference, 40GB for lora finetune) +Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is +recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform +rapid stacking and development. +(18 GB for inference, 40GB for lora finetune) ### Diffusers @@ -41,8 +71,9 @@ Follow instructions in [sat_demo](sat/README.md): Contains the inference code an pip install -r requirements.txt ``` -Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters. - (36GB for inference, smaller memory and fine-tuned code are under development) +Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the +significance of common parameters. +(24GB for inference,fine-tuned code are under development) ## CogVideoX-2B Gallery @@ -77,14 +108,14 @@ along with related basic information: | Model Name | CogVideoX-2B | |-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Prompt Language | English | -| GPU Memory Required for Inference (FP16) | 18GB if using [SAT](https://github.com/THUDM/SwissArmyTransformer); 36GB if using diffusers (will be optimized before the PR is merged) | +| Single GPU Inference (FP16) | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)
23.9GB using diffusers | +| Multi GPUs Inference (FP16) | 20GB minimum per GPU using diffusers | | GPU Memory Required for Fine-tuning(bs=1) | 40GB | | Prompt Max Length | 226 Tokens | | Video Length | 6 seconds | | Frames Per Second | 8 frames | | Resolution | 720 * 480 | | Quantized Inference | Not Supported | -| Multi-card Inference | Not Supported | | Download Link (HF diffusers Model) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) | | Download Link (SAT Model) | [SAT](./sat/README.md) | @@ -95,16 +126,25 @@ of the **CogVideoX** open-source model. ### Inference -+ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters. -+ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future. -+ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as GPT, Gemini, etc. -+ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate videos. ++ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the + significance of common parameters. ++ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of + memory, but it will be optimized in the future. ++ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because + CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training + distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as + GPT, Gemini, etc. ++ [gradio_web_demo](inference/gradio_web_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B + model to generate + videos.
-+ [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model to generate videos. ++ [streamlit_web_demo](inference/streamlit_web_demo.py): A simple streamlit web application demonstrating how to use the + CogVideoX-2B model + to generate videos.
@@ -112,40 +152,25 @@ of the **CogVideoX** open-source model. ### sat -+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. ++ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to + improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking + and development. ### Tools This folder contains some tools for model conversion / caption generation, etc. -+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights. ++ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights. + [caption_demo](tools/caption): Caption tool, a model that understands videos and outputs them in text. -## Project Plan - -- [x] Open source CogVideoX model - - [x] Open source 3D Causal VAE used in CogVideoX. - - [x] CogVideoX model inference example (CLI / Web Demo) - - [x] CogVideoX online experience demo (Huggingface Space) - - [x] CogVideoX open source model API interface example (Huggingface) - - [x] CogVideoX model fine-tuning example (SAT) - - [ ] CogVideoX model fine-tuning example (Huggingface / SAT) - - [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite) - - [x] Release CogVideoX technical report - -We welcome your contributions. You can click [here](resources/contribute.md) for more information. - -## Model License - -The code in this repository is released under the [Apache 2.0 License](LICENSE). - -The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE). - ## CogVideo(ICLR'23) -The official repo for the paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) is on the [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) + +The official repo for the +paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) +is on the [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) **CogVideo is able to generate relatively high-frame-rate videos.** -A 4-second clip of 32 frames is shown below. +A 4-second clip of 32 frames is shown below. ![High-frame-rate sample](https://raw.githubusercontent.com/THUDM/CogVideo/CogVideo/assets/appendix-sample-highframerate.png) @@ -155,8 +180,8 @@ A 4-second clip of 32 frames is shown below.
-The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/), where you can get hands-on practice on text-to-video generation. *The original input is in Chinese.* - +The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/), where you can get +hands-on practice on text-to-video generation. *The original input is in Chinese.* ## Citation @@ -175,3 +200,23 @@ The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.a year={2022} } ``` + +## Open Source Project Plan + +- [x] Open source CogVideoX model + - [x] Open source 3D Causal VAE used in CogVideoX. + - [x] CogVideoX model inference example (CLI / Web Demo) + - [x] CogVideoX online experience demo (Huggingface Space) + - [x] CogVideoX open source model API interface example (Huggingface) + - [x] CogVideoX model fine-tuning example (SAT) + - [ ] CogVideoX model fine-tuning example (Huggingface / SAT) + - [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite) + - [x] Release CogVideoX technical report + +We welcome your contributions. You can click [here](resources/contribute.md) for more information. + +## Model License + +The code in this repository is released under the [Apache 2.0 License](LICENSE). + +The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE). diff --git a/README_zh.md b/README_zh.md index 70419c99..bf97f15b 100644 --- a/README_zh.md +++ b/README_zh.md @@ -21,18 +21,43 @@ ## 项目更新 -- 🔥 **News**: ``2024/8/6``: 我们开源 **3D Causal VAE**,用于 **CogVideoX-2B**,可以几乎无损地重构视频。 -- 🔥 **News**: ``2024/8/6``: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。 -- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于 Transformer 的大型文本生成视频模型,您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。 -**性能更强,参数量更大的模型正在到来的路上~,欢迎关注** - +- 🔥 **News**: ```2024/8/7```: CogVideoX 已经合并入 `diffusers` 0.30.0版本,单张3090可以推理,详情请见[代码](inference/cli_demo.py)。 +- 🔥 **News**: ```2024/8/6```: 我们开源 **3D Causal VAE**,用于 **CogVideoX-2B**,可以几乎无损地重构视频。 +- 🔥 **News**: ```2024/8/6```: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。 +- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于 + Transformer 的大型文本生成视频模型,您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。 + **性能更强,参数量更大的模型正在到来的路上~,欢迎关注** + +## 目录 + +跳转到指定部分: + +- [快速开始](#快速开始) + - [SAT](#sat) + - [Diffusers](#Diffusers) +- [CogVideoX-2B 视频作品](#cogvideox-2b-视频作品) +- [CogVideoX模型介绍](#模型介绍) +- [完整项目代码结构](#完整项目代码结构) + - [Inference](#inference) + - [SAT](#sat) + - [Tools](#tools) +- [开源项目规划](#开源项目规划) +- [模型协议](#模型协议) +- [CogVideo(ICLR'23)模型介绍](#cogvideoiclr23) +- [引用](#引用) ## 快速开始 +### 提示词优化 + +在开始运行模型之前,请参考[这里](inference/convert_demo.py) 查看我们是怎么使用GLM-4大模型对模型进行优化的,这很重要, +由于模型是在长提示词下训练的,一额好的直接影响了视频生成的质量。 + ### SAT -查看sat文件夹下的[sat_demo](sat/README.md):包含了 SAT 权重的推理代码和微调代码,推荐基于此代码进行 CogVideoX 模型结构的改进,研究者使用该代码可以更好的进行快速的迭代和开发。 - (18 GB 推理, 40GB lora微调) +查看sat文件夹下的[sat_demo](sat/README.md):包含了 SAT 权重的推理代码和微调代码,推荐基于此代码进行 CogVideoX +模型结构的改进,研究者使用该代码可以更好的进行快速的迭代和开发。 +(18 GB 推理, 40GB lora微调) ### Diffusers @@ -40,7 +65,7 @@ pip install -r requirements.txt ``` -查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。(36GB 推理,显存优化以及微调代码正在开发) +查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。(24GB 推理,微调代码正在开发) ## CogVideoX-2B 视频作品 @@ -70,21 +95,21 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源 下表战展示目前我们提供的视频生成模型列表,以及相关基础信息: -| 模型名字 | CogVideoX-2B | -|---------------------|--------------------------------------------------------------------------------------------------------------------------------------| -| 提示词语言 | English | -| 推理显存消耗 (FP-16) | 36GB using diffusers (will be optimized before the PR is merged) and 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer) | -| 微调显存消耗 (bs=1) | 42GB | -| 提示词长度上限 | 226 Tokens | -| 视频长度 | 6 seconds | -| 帧率(每秒) | 8 frames | -| 视频分辨率 | 720 * 480 | -| 量化推理 | 不支持 | -| 多卡推理 | 不支持 | -| 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) | -| 下载地址 (SAT 模型) | [SAT](./sat/README_zh.md) | - -## 项目结构 +| 模型名 | CogVideoX-2B | +|---------------------|-------------------------------------------------------------------------------------------------------------------------------| +| 提示词语言 | English | +| 单GPU推理 (FP-16) 显存消耗 | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)
23.9GB using diffusers | +| 多GPU推理 (FP-16) 显存消耗 | 20GB minimum per GPU using diffusers | +| 微调显存消耗 (bs=1) | 42GB | +| 提示词长度上限 | 226 Tokens | +| 视频长度 | 6 seconds | +| 帧率(每秒) | 8 frames | +| 视频分辨率 | 720 * 480 | +| 量化推理 | 不支持 | +| 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) | +| 下载地址 (SAT 模型) | [SAT](./sat/README_zh.md) | + +## 完整项目代码结构 本开源仓库将带领开发者快速上手 **CogVideoX** 开源模型的基础调用方式、微调示例。 @@ -92,14 +117,15 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源 + [diffusers_demo](inference/cli_demo.py): 更详细的推理代码讲解,常见参数的意义,在这里都会提及。 + [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码,目前需要71GB显存,将来会优化。 -+ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合 CogVideoX的长输入。因为CogVideoX是在长文本上训练的,所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4,也可以替换为GPT、Gemini等任意大语言模型。 -+ [gradio_demo](gradio_demo.py): 一个简单的gradio网页应用,展示如何使用 CogVideoX-2B 模型生成视频。 ++ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合 + CogVideoX的长输入。因为CogVideoX是在长文本上训练的,所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4,也可以替换为GPT、Gemini等任意大语言模型。 ++ [gradio_web_demo](inference/gradio_web_demo.py): 一个简单的gradio网页应用,展示如何使用 CogVideoX-2B 模型生成视频。
-+ [web_demo](inference/web_demo.py): 一个简单的streamlit网页应用,展示如何使用 CogVideoX-2B 模型生成视频。 ++ [streamlit_web_demo](inference/streamlit_web_demo.py): 一个简单的streamlit网页应用,展示如何使用 CogVideoX-2B 模型生成视频。
@@ -117,27 +143,10 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源 + [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): 将 SAT 模型权重转换为 Huggingface 模型权重。 + [caption_demo](tools/caption/README_zh.md): Caption 工具,对视频理解并用文字输出的模型。 -## 项目规划 - -- [x] CogVideoX 模型开源 - - [x] CogVideoX 模型推理示例 (CLI / Web Demo) - - [x] CogVideoX 在线体验示例 (Huggingface Space) - - [x] CogVideoX 开源模型API接口示例 (Huggingface) - - [x] CogVideoX 模型微调示例 (SAT) - - [ ] CogVideoX 模型微调示例 (Huggingface / SAT) - - [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件) - - [ ] CogVideoX 技术报告公开 - -我们欢迎您的贡献,您可以点击[这里](resources/contribute_zh.md)查看更多信息。 - -## 模型协议 - -本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。 - -本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。 +## CogVideo(ICLR'23) -## CogVideo(ICLR'23) - [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) 的官方repo位于[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)。 +[CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) +的官方repo位于[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)。 **CogVideo可以生成高帧率视频,下面展示了一个32帧的4秒视频。** @@ -150,11 +159,12 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
-CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/)。您可以在这里体验文本到视频生成。*原始输入为中文。* +CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/)。您可以在这里体验文本到视频生成。 +*原始输入为中文。* ## 引用 -🌟 如果您发现我们的工作有所帮助,欢迎引用我们的文章,留下宝贵的stars +🌟 如果您发现我们的工作有所帮助,欢迎引用我们的文章,留下宝贵的stars ``` @article{yang2024cogvideox, @@ -168,4 +178,23 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine journal={arXiv preprint arXiv:2205.15868}, year={2022} } -``` \ No newline at end of file +``` + +## 开源项目规划 + +- [x] CogVideoX 模型开源 + - [x] CogVideoX 模型推理示例 (CLI / Web Demo) + - [x] CogVideoX 在线体验示例 (Huggingface Space) + - [x] CogVideoX 开源模型API接口示例 (Huggingface) + - [x] CogVideoX 模型微调示例 (SAT) + - [ ] CogVideoX 模型微调示例 (Huggingface / SAT) + - [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件) + - [X] CogVideoX 技术报告公开 + +我们欢迎您的贡献,您可以点击[这里](resources/contribute_zh.md)查看更多信息。 + +## 模型协议 + +本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。 + +本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。 diff --git a/inference/cli_demo.py b/inference/cli_demo.py index c480d439..d069f022 100644 --- a/inference/cli_demo.py +++ b/inference/cli_demo.py @@ -22,7 +22,7 @@ def export_to_video_imageio( - video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8 + video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8 ) -> str: """ Export the video frames to a video file using imageio lib to Avoid "green screen" issue (for example CogVideoX) @@ -38,14 +38,14 @@ def export_to_video_imageio( def generate_video( - prompt: str, - model_path: str, - output_path: str = "./output.mp4", - num_inference_steps: int = 50, - guidance_scale: float = 6.0, - num_videos_per_prompt: int = 1, - device: str = "cuda", - dtype: torch.dtype = torch.float16, + prompt: str, + model_path: str, + output_path: str = "./output.mp4", + num_inference_steps: int = 50, + guidance_scale: float = 6.0, + num_videos_per_prompt: int = 1, + device: str = "cuda", + dtype: torch.dtype = torch.float16, ): """ Generates a video based on the given prompt and saves it to the specified path. @@ -62,7 +62,10 @@ def generate_video( """ # Load the pre-trained CogVideoX pipeline with the specified precision (float16) and move it to the specified device - pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype).to(device) + # add device_map="balanced" in the from_pretrained function and remove + # `pipe.enable_model_cpu_offload()` to enable Multi GPUs (2 or more and each one must have more than 20GB memory) inference. + pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype) + pipe.enable_model_cpu_offload() # Encode the prompt to get the prompt embeddings prompt_embeds, _ = pipe.encode_prompt( diff --git a/gradio_demo.py b/inference/gradio_web_demo.py similarity index 99% rename from gradio_demo.py rename to inference/gradio_web_demo.py index ea0b0200..4b4cad08 100644 --- a/gradio_demo.py +++ b/inference/gradio_web_demo.py @@ -16,7 +16,8 @@ dtype = torch.bfloat16 device = "cuda" if torch.cuda.is_available() else "cpu" -pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=dtype).to(device) +pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=dtype) +pipe.enable_model_cpu_offload() sys_prompt = """You are part of a team of bots that creates videos. You work with an assistant bot that will draw anything you say in square brackets. diff --git a/inference/web_demo.py b/inference/streamlit_web_demo.py similarity index 97% rename from inference/web_demo.py rename to inference/streamlit_web_demo.py index 8695975e..342d85b4 100644 --- a/inference/web_demo.py +++ b/inference/streamlit_web_demo.py @@ -39,7 +39,9 @@ def load_model(model_path: str, dtype: torch.dtype, device: str) -> CogVideoXPip Returns: - CogVideoXPipeline: Loaded model pipeline. """ - return CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype).to(device) + pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype) + pipe.enable_model_cpu_offload() + return pipe # Define a function to generate video based on the provided prompt and model path @@ -76,7 +78,7 @@ def generate_video( device=device, dtype=dtype, ) - + pipe.enable_model_cpu_offload() # Generate video video = pipe( num_inference_steps=num_inference_steps, diff --git a/requirements.txt b/requirements.txt index bc64475f..55b376eb 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,8 +1,9 @@ -git+https://github.com/huggingface/diffusers.git@d1c575ad7ee0390c2735f50cc59a79aae666567a#egg=diffusers -SwissArmyTransformer +diffusers>=0.3.0 +SwissArmyTransformer==0.4.11 # Inference torch==2.4.0 torchvision==0.19.0 -streamlit==1.37.0 +gradio==4.40.0 # For HF gradio demo +streamlit==1.37.0 # For web demo opencv-python==4.10 imageio-ffmpeg==0.5.1 openai==1.38.0 diff --git a/sat/README.md b/sat/README.md index a2e69d6b..7325be0a 100644 --- a/sat/README.md +++ b/sat/README.md @@ -1,6 +1,7 @@ # SAT CogVideoX-2B -This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the fine-tuning code for SAT weights. +This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the +fine-tuning code for SAT weights. This code is the framework used by the team to train the model. It has few comments and requires careful study. @@ -41,12 +42,27 @@ Then unzip, the model structure should look like this: Next, clone the T5 model, which is not used for training and fine-tuning, but must be used. -```shell -git lfs install -git clone https://huggingface.co/google/t5-v1_1-xxl.git +``` +git clone https://huggingface.co/THUDM/CogVideoX-2b.git +mkdir t5-v1_1-xxl +mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl ``` -**We don't need the tf_model.h5** file. This file can be deleted. +By following the above approach, you will obtain a safetensor format T5 file. Ensure that there are no errors when +loading it into Deepspeed in Finetune. + +``` +├── added_tokens.json +├── config.json +├── model-00001-of-00002.safetensors +├── model-00002-of-00002.safetensors +├── model.safetensors.index.json +├── special_tokens_map.json +├── spiece.model +└── tokenizer_config.json + +0 directories, 8 files +``` 3. Modify the file `configs/cogvideox_2b_infer.yaml`. @@ -101,6 +117,9 @@ bash inference.sh ### Preparing the Environment +Please note that currently, SAT needs to be installed from the source code for proper fine-tuning. We will address this +issue in future stable releases. + ``` git clone https://github.com/THUDM/SwissArmyTransformer.git cd SwissArmyTransformer @@ -130,7 +149,8 @@ For style fine-tuning, please prepare at least 50 videos and labels with similar ### Modifying the Configuration File -We support both `Lora` and `full-parameter fine-tuning` methods. Please note that both fine-tuning methods only apply to the `transformer` part. The `VAE part` is not modified. `T5` is only used as an Encoder. +We support both `Lora` and `full-parameter fine-tuning` methods. Please note that both fine-tuning methods only apply to +the `transformer` part. The `VAE part` is not modified. `T5` is only used as an Encoder. the `configs/cogvideox_2b_sft.yaml` (for full fine-tuning) as follows. diff --git a/sat/README_zh.md b/sat/README_zh.md index e2d9be93..61f00f60 100644 --- a/sat/README_zh.md +++ b/sat/README_zh.md @@ -41,13 +41,24 @@ unzip transformer.zip 接着,克隆 T5 模型,该模型不用做训练和微调,但是必须使用。 -```shell -git lfs install -git clone https://huggingface.co/google/t5-v1_1-xxl.git ``` +git clone https://huggingface.co/THUDM/CogVideoX-2b.git +mkdir t5-v1_1-xxl +mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl +``` +通过上述方案,你将会得到一个 safetensor 格式的T5文件,确保在 Deepspeed微调过程中读入的时候不会报错。 +``` +├── added_tokens.json +├── config.json +├── model-00001-of-00002.safetensors +├── model-00002-of-00002.safetensors +├── model.safetensors.index.json +├── special_tokens_map.json +├── spiece.model +└── tokenizer_config.json -**我们不需要使用tf_model.h5**文件。该文件可以删除。 - +0 directories, 8 files +``` 3. 修改`configs/cogvideox_2b_infer.yaml`中的文件。 ```yaml @@ -101,6 +112,8 @@ bash inference.sh ### 准备环境 +请注意,目前,SAT需要从源码安装,才能正常微调, 我们将会在未来的稳定版本解决这个问题。 + ``` git clone https://github.com/THUDM/SwissArmyTransformer.git cd SwissArmyTransformer