add cogvideo performance on 12 GPUs

xdit-project · Oct 18, 2024 · d17491c · d17491c
1 parent 95b919a
commit d17491c
Show file tree

Hide file tree

Showing 2 changed files with 17 additions and 12 deletions.
diff --git a/docs/performance/cogvideo.md b/docs/performance/cogvideo.md
@@ -1,26 +1,28 @@
 ## CogVideo Performance
 [Chinese Version](./cogvideo_zh.md)
 
-CogVideo functions as a text-to-video model. xDiT presently integrates USP techniques (including Ulysses attention and Ring attention) and CFG parallelism to enhance inference speed, while work on PipeFusion is ongoing. Due to constraints in video generation dimensions in CogVideo, the maximum parallelism level for USP is 2. Thus, xDiT can leverage up to 4 GPUs to execute CogVideo, despite the potential for additional GPUs within the machine.
+CogVideo is a model that converts text to video. xDiT currently integrates USP technology (including Ulysses Attention and Ring Attention) and CFG parallel processing to improve inference speed, while work on PipeFusion is ongoing. We conducted a thorough analysis of the performance differences between a single GPU CogVideoX inference based on the diffusers library and our proposed parallel version when generating a 49-frame (6-second) 720x480 resolution video. We can combine different parallel methods arbitrarily to achieve varying performance. In this paper, we systematically tested the acceleration performance of xDiT on 1-12 L40 (PCIe) GPUs.
 
-In a system equipped with L40 (PCIe) GPUs, we compared the inference performance of single-GPU CogVideoX utilizing the `diffusers` library with our parallelized versions for generating 49-frame (6-second) 720x480 videos.
-
-As depicted in the figure, across the baseline model CogVideoX-2b, inference latency reductions were observed when employing Ulysses Attention, Ring Attention, or CFG parallelism. Notably, CFG parallelism demonstrated superior performance due to its lower communication overhead. By combining sequence parallelism with CFG parallelism, we further enhanced inference efficiency. As the degree of parallelism increased, the latency consistently decreased. Under optimal settings, xDiT achieved a 3.53x speedup over single-GPU inference, reducing each iteration to 0.6 seconds. Given CogVideoX's default 50 iterations, a 6-second video can be generated end-to-end within 30 seconds. 
+As shown in the figures, for the base model CogVideoX-2b, significant reductions in inference latency were observed whether using Ulysses Attention, Ring Attention, or Classifier-Free Guidance (CFG) parallel processing. It is noteworthy that due to its lower communication overhead, the CFG parallel method outperforms the other two technologies in terms of performance. By combining sequential parallelism and CFG parallelism, we successfully increased inference efficiency. With increasing parallelism, the inference latency continues to decrease. In the optimal configuration, xDiT achieves a 4.29x acceleration relative to single GPU inference, reducing the time for each iteration to just 0.49 seconds. Given the default 50 iterations of CogVideoX, the end-to-end generation of a 24.5-second video can be completed in a total of 30 seconds.
 
 <div align="center">
  <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-2b.png" 
  alt="latency-cogvideo-l40-2b">
 </div>
 
-For the more complex CogVideoX-5b model, which incorporates additional parameters for improved video quality and visual effects, albeit with increased computational costs, similar performance trends were maintained. However, the acceleration ratio of the parallel versions was further enhanced. In comparison to the single-GPU version, xDiT attained a speedup of up to 3.91x, enabling end-to-end video generation in just over 80 seconds.
+For the more complex CogVideoX-5b model, although increasing parameters to enhance video quality and visual effects leads to a significant rise in computational costs, all methods maintain similar performance trends to CogVideoX-2b on this model, with further improvements in the acceleration effects of the parallel versions. Compared to the single GPU version, xDiT achieves up to a 7.75x increase in inference speed, reducing the end-to-end video generation time to around 40 seconds.
 
 <div align="center">
  <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-5b.png" 
  alt="latency-cogvideo-l40-5b">
 </div>
 
-Similarly, on systems equipped with A100 devices, xDiT exhibited comparable acceleration ratios.
+On systems equipped with A100 GPUs, xDiT demonstrates similar acceleration effects on CogVideoX-2b and CogVideoX-5b, as shown in the two figures below.
 
+<div align="center">
+ <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-2b.png" 
+ alt="latency-cogvideo-a100-5b">
+</div>
 <div align="center">
  <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-5b.png" 
  alt="latency-cogvideo-a100-5b">

diff --git a/docs/performance/cogvideo_zh.md b/docs/performance/cogvideo_zh.md
@@ -1,24 +1,27 @@
 ## CogVideo 性能表现
 
-CogVideo 是一个文本到视频的模型。xDiT 目前整合了 USP 技术（包括 Ulysses 注意力和 Ring 注意力）和 CFG 并行来提高推理速度，同时 PipeFusion 的工作正在进行中。由于 CogVideo 在视频生成尺寸上的限制，USP 的最大并行级别为 2。因此，xDiT 可以利用最多 4 个 GPU 来执行 CogVideo，尽管机器内可能有更多的 GPU。
+CogVideo 是一个文本到视频的模型。xDiT 目前整合了 USP 技术（包括 Ulysses 注意力和 Ring 注意力）和 CFG 并行来提高推理速度，同时 PipeFusion 的工作正在进行中。我们对基于 `diffusers` 库的单 GPU CogVideoX 推理与我们提出的并行化版本在生成 49帧（6秒）720x480 分辨率视频时的性能差异进行了深入分析。由于我们可以任意组合不同的并行方式以获得不同的性能。在本文中，我们对xDiT在1-12张L40（PCIe）GPU上的加速性能进行了系统测试。
 
-在配备 L40（PCIe）GPU 的计算平台上，我们对基于 `diffusers` 库的单 GPU CogVideoX 推理与我们提出的并行化版本在生成 49帧（6秒）720x480 分辨率视频时的性能差异进行了深入分析。
-
-如图所示，对于基础模型 CogVideoX-2b，无论是采用 Ulysses Attention、Ring Attention 还是 Classifier-Free Guidance（CFG）并行，均观察到推理延迟的显著降低。值得注意的是，由于其较低的通信开销，CFG 并行方法在性能上优于其他两种技术。通过结合序列并行和 CFG 并行，我们成功提升了推理效率。随着并行度的增加，推理延迟持续下降。在最优配置下，xDiT 相对于单GPU推理实现了 3.53 倍的加速，使得每次迭代仅需 0.6 秒。鉴于 CogVideoX 默认的 50 次迭代，总计 30 秒即可完成 6 秒视频的端到端生成。
+如图所示，对于基础模型 CogVideoX-2b，无论是采用 Ulysses Attention、Ring Attention 还是 Classifier-Free Guidance（CFG）并行，均观察到推理延迟的显著降低。值得注意的是，由于其较低的通信开销，CFG 并行方法在性能上优于其他两种技术。通过结合序列并行和 CFG 并行，我们成功提升了推理效率。随着并行度的增加，推理延迟持续下降。在最优配置下，xDiT 相对于单GPU推理实现了 4.29 倍的加速，使得每次迭代仅需 0.49 秒。鉴于 CogVideoX 默认的 50 次迭代，总计 30 秒即可完成 24.5 秒视频的端到端生成。
 
 <div align="center">
  <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-2b.png" 
  alt="latency-cogvideo-l40-2b">
 </div>
 
-对于更复杂的 CogVideoX-5b 模型，虽然其增加了参数以提升视频质量和视觉效果，导致计算成本显著增加，但所有方法在该模型上仍保持了与 CogVideoX-2b 相似的性能趋势，且并行版本的加速比进一步提升。与单GPU版本相比，xDiT 实现了高达 3.91 倍的推理速度提升，将端到端视频生成时间缩短至 80 秒左右。
+针对更复杂的CogVideoX-5b模型，虽然参数增加以提升视频质量和视觉效果，导致计算成本显著上升，但在该模型上，所有方法仍然保持与CogVideoX-2b相似的性能趋势，且并行版本的加速效果进一步提升。相较于单GPU版本，xDiT实现了高达7.75倍的推理速度提升，将端到端视频生成时间缩短至约40秒。
 
 <div align="center">
  <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-5b.png" 
  alt="latency-cogvideo-l40-5b">
 </div>
 
-同样，在配备 A100 GPU 的系统上，xDiT 也展示了类似的加速效果。
+在搭载A100 GPU的系统中，xDiT 在 CogVideoX-2b 和 CogVideoX-5b 上展现出类似的加速效果，具体表现可见下方两图。
+
+<div align="center">
+ <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-5b.png" 
+ alt="latency-cogvideo-a100-2b">
+</div>
 
 <div align="center">
  <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-5b.png"