diff --git a/blogs/deepspeed-ulysses/README.md b/blogs/deepspeed-ulysses/README.md index 6c5cc63cbdfa..314787dc1abe 100644 --- a/blogs/deepspeed-ulysses/README.md +++ b/blogs/deepspeed-ulysses/README.md @@ -10,6 +10,17 @@ +To cite DeepSpeed-Ulysses, please cite our [arxiv report](https://arxiv.org/abs/2309.14327): + +``` +@article{jacobs2023deepspeed, + title={DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models}, + author={Sam Ade Jacobs and Masahiro Tanaka and Chengming Zhang and Minjia Zhang and Shuaiwen Leon Song and Samyam Rajbhandari and Yuxiong He}, + journal={arXiv preprint arXiv:2309.14509}, + year={2023}, +} +``` + ## Introduction Training large models with long sequences is becoming very important @@ -193,7 +204,7 @@ scaling not just to large sequence lengths but also to large models. ## Evaluation -We evaluate DeepSpeed-Ulysses on GPT, +We evaluate DeepSpeed-Ulysses (Ulysses) on GPT, a foundation model for many NLP tasks on up to 64 A100 GPUs with 40GB memory. Our evaluations are four-fold: i) sequence length scalability, ii) throughput for dense attention and comparison with existing system, and @@ -212,7 +223,7 @@ maintains similar computation throughput across different sequence length at appropriate GPU count.
- + *Figure 2: DeepSpeed sequence parallelism strong scalability evaluation at different sequence length and GPU count.* @@ -220,66 +231,89 @@ at different sequence length and GPU count.* ### Dense Attention Evaluation -Next, we evaluate DeepSpeed sequence parallelism on 30 billion parameter -dense attention model and benchmark against Megatron sequence -parallelism on 64 A100 GPUs. The results of these evaluations are shown -in Figures 3. - -We compare DeepSpeed sequence parallelism with Megatron-LM for a 30B -model running various sequence lengths. For our evaluation we chose the -sequence parallelism degree and global batch size that produced the best -performance (measured as throughput or TFLOPs) for both DeepSpeed -sequence parallelism and Megatron-LM, this we call optimal (batch -size-sequence length) configurations. For DeepSpeed sequence -parallelism, we always use a ZeRO parallelism degree of 64. - -Figure 3 shows that DeepSpeed sequence parallelism consistently -outperforms Megatron-LM for the sequence length that can be run with -both. In addition, DeepSpeed sequence parallelism can run longer -sequence than Megatron-LM. DeepSpeed sequence parallelism performance -advantages are two folds: (1) DeepSpeed sequence parallelism in -combination with ZeRO-3 fits more sample than Megatron-LM because of -memory optimization leading to higher throughput (2) DeepSpeed sequence -parallelism benefits from efficient all-to-all communication relative to -*all-gather* communication as applied in Megatron-LM sequence -parallelism. +Next, we evaluate Ulysses on 7 billion (7B) and 30 billion (30B) parameter +GPT dense attention models and compare against Megatron-LM's sequence +parallelism (Megatron LM) and Colosal AI sequence parallelism (ColAI-SP) on +32 and 64 A100 GPUs respectively. The results of these evaluations are shown +in Figures 3 and 4. + +We compare Ulysses with Megatron-LM and ColAI-SP for 7B and 30B models +running various sequence lengths. We chose the sequence parallelism +degree and micro-batch size that produced the best performance +(measured as TFLOPs) for the three methods, this we call optimal +(batch size-sequence length) configurations. For Ulysses, we always +use a ZeRO-3 parallelism degrees of 32 and 64 for 7B and 30B models +respectively. + + +Figures 3 and 4 show that Ulysses consistently outperforms Megatron-LM +and ColAI-SP for the sequence length that can be run with them. In addition, +Ulysses can run longer sequence than the two existing methods. Ulysses +performance advantages are two folds: (1) Ulysses in combination with ZeRO-3 +parameter sharding across both data and sequence parallel groups fits more +samples than Megatron-LM and ColAI-SP because of the memory optimization +leading to higher throughput (2) Ulysses benefits from efficient *all-to-all* +communication relative to *all-gather* *reduce-scatter* and *ring-style* P2P +communication as applied in Megatron-LM and ColAI-SP sequence parallelism. +However, for dense attention at long sequence length, the throughput is +primarily determined by local attention computation due to quadratic +computation complexity of attention, therefore performance gap between Ulysses +and the two existing methods closes for sequence length that can be run with them.
- + -*Figure 3: Evaluation of DeepSpeed and Megatron LM sequence parallelism on 30B -parameter model with dense attention.* +*Figure 3: Evaluation of Ulysses vs Megatron LM vs ColAI-SP on GPT-7B parameter + model with dense attention (32 GPUs).* +
+ +
+ + +*Figure 4: Evaluation of Ulysses vs Megatron LM vs ColAI-SP on GPT-30B parameter + model with dense attention (64 GPUs).*
### Sparse Attention Evaluation -Similarly, we evaluate DeepSpeed sequence parallelism on 30 billion -parameter sparse attention model and benchmark against Megatron sequence -parallelism. Results of our evaluation are shown in Figure 4. We observe -similar trends with sparse attention as dense attention experiments. We -observe more than 2X throughput performance of DeepSpeed sequence -parallelism compared to Megatron-LM. For memory saving, DeepSpeed -sequence parallelism leveraging ZeRO-3 scales to 4X longer sequence -lengths than Megatron-LM. - -DeepSpeed sequence parallelism outperforms Megatron-LM for sequence -length that can be run with both. In fact, the current DeepSpeed -throughput is bottlenecked by the local sparse attention implementation, -and as a result DeepSpeed throughput decreases as the sequence length -increases. We expect this gap in performance between DeepSpeed and -Megatron to increase further for larger sequence lengths as we improve -the performance of the local sparse attention implementation in future. +Similarly, we evaluate Ulysses on 7 billion and 30 billion parameter sparse +attention models and benchmark against Megatron-LM sequence parallelism. +There is no public implementation of block sparse attention for ColAI-SP, +therefore, evaluation of sparse attention is in comparison with Megatron-LM. +Results of our evaluation are shown in Figures 5 and 6. We observe similar +trends with sparse attention as dense attention experiments. We observe more +than 2x throughput performance of Ulysses compared to Megatron-LM. For memory +saving, Ulysses leveraging ZeRO-3 scales to 4x longer sequence lengths +than Megatron-LM. + +Ulysses outperforms Megatron-LM for sequence length that can be run with both. +In fact, the current Ulysses throughput is bottle-necked by the local sparse +attention implementation, and as a result Ulysses throughput decreases as +the sequence length increases. We expect this gap in performance between our +method and Megatron-LM to increase further for larger sequence lengths as we +improve the performance of the local sparse attention implementation in future. +A noteworthy observation is that the decreasing performance gap between Ulysses +and Megatron-LM observed in dense attention evaluation is less pronounced in +sparse attention evaluation, because the attention computation in sparse attention +is less dominant compared to dense attention. + +
+ + +*Figure 5: Evaluation of Ulysses and Megatron LM sequence parallelism on GPT-7B +parameter model with block sparse attention (32 GPUs).* +
- + -*Figure 4: Evaluation of DeepSpeed and Megatron LM sequence parallelism on 30B -parameter model with block sparse attention.* +*Figure 6: Evaluation of Ulysses and Megatron LM sequence parallelism on GPT-30B +parameter model with block sparse attention (64 GPUs).*
### Convergence Study -Lastly, Figure 5 shows convergence of a 1.3 billion GPT model at 32K +Lastly, Figure 7 shows convergence of a 1.3 billion GPT model at 32K sequence length on 8 A100 GPUs with sequence parallelism degree set at 4 for both DeepSpeed and Megatron-LM sequence parallelism. For DeepSpeed sequence parallelism, we evaluate convergence with different ZeRO @@ -289,9 +323,9 @@ there is no (negative) impact on quality of trained models, this assertion is validated through experiments and is shown in Figure 5.
- + -*Figure 5: Convergence evaluation of DeepSpeed sequence parallelism with different +*Figure 7: Convergence evaluation of DeepSpeed sequence parallelism with different ZeRO memory optimization stages.*
diff --git a/blogs/deepspeed-ulysses/media/convgZ.png b/blogs/deepspeed-ulysses/media/convgZ.png new file mode 100644 index 000000000000..324f47cd61bd Binary files /dev/null and b/blogs/deepspeed-ulysses/media/convgZ.png differ diff --git a/blogs/deepspeed-ulysses/media/dense1B1Mscale.png b/blogs/deepspeed-ulysses/media/dense1B1Mscale.png new file mode 100644 index 000000000000..eb886f879247 Binary files /dev/null and b/blogs/deepspeed-ulysses/media/dense1B1Mscale.png differ diff --git a/blogs/deepspeed-ulysses/media/dense30B.png b/blogs/deepspeed-ulysses/media/dense30B.png new file mode 100644 index 000000000000..d2eef04b73cc Binary files /dev/null and b/blogs/deepspeed-ulysses/media/dense30B.png differ diff --git a/blogs/deepspeed-ulysses/media/dense7B.png b/blogs/deepspeed-ulysses/media/dense7B.png new file mode 100644 index 000000000000..042269276a6b Binary files /dev/null and b/blogs/deepspeed-ulysses/media/dense7B.png differ diff --git a/blogs/deepspeed-ulysses/media/sparse30B.png b/blogs/deepspeed-ulysses/media/sparse30B.png new file mode 100644 index 000000000000..2637d353d0c6 Binary files /dev/null and b/blogs/deepspeed-ulysses/media/sparse30B.png differ diff --git a/blogs/deepspeed-ulysses/media/sparse7B.png b/blogs/deepspeed-ulysses/media/sparse7B.png new file mode 100644 index 000000000000..2d9c9ad69420 Binary files /dev/null and b/blogs/deepspeed-ulysses/media/sparse7B.png differ