add lm_head and embed_out tensor parallel #3962

Yejing-Lai · 2023-07-14T14:17:54Z

This PR aims to add lm_head and embed_out layer tensor parallel. This applies to models whose last layer is named lm_head/embed_out, such as llama, gpt-j, bloom, opt, and so on.

RezaYazdaniAminabadi · 2023-07-14T16:35:07Z

This PR aims to add lm_head and embed_out layer tensor parallel. This applies to models whose last layer is named lm_head/embed_out, such as llama, gpt-j, bloom, opt, and so on.
Hi @Yejing-Lai

Thanks for the nice PR to add the parallelism in the last embedding linear layer. I was just wondering how much of the performance impact this has, considering that you have the communication overhead? Also, you can split the weight in different ways, either across the hidden_dim requiring an all-reduce (which is what you did), or across the embedding dimension which then needs an all-gather. Either way, has their own tradeoffs, in the case of embedding-dim we can reduce the amount of computation if there is a very large vocabulary, however, with sharding across the model dimension, we reduce the amount of communication in this case. I think having both of these options makes sense and we can later decide on which needs to be enabled.

Yejing-Lai · 2023-07-17T04:35:08Z

This PR aims to add lm_head and embed_out layer tensor parallel. This applies to models whose last layer is named lm_head/embed_out, such as llama, gpt-j, bloom, opt, and so on.
Hi @Yejing-Lai

Thanks for the nice PR to add the parallelism in the last embedding linear layer. I was just wondering how much of the performance impact this has, considering that you have the communication overhead? Also, you can split the weight in different ways, either across the hidden_dim requiring an all-reduce (which is what you did), or across the embedding dimension which then needs an all-gather. Either way, has their own tradeoffs, in the case of embedding-dim we can reduce the amount of computation if there is a very large vocabulary, however, with sharding across the model dimension, we reduce the amount of communication in this case. I think having both of these options makes sense and we can later decide on which needs to be enabled.

Hi @RezaYazdaniAminabadi. We observed an increase in E2E performance on SPR after adding lm_head/embed_out TP. I think the amount of calculations reduced by splitting weights on hidden_dim or embedding_dim is the same, and they all distribute the weights equally to each rank. Thanks~

delock · 2023-07-24T02:42:50Z

deepspeed/module_inject/layers.py

+        output = torch.matmul(input[:, :, self.rank * input_shard:(self.rank + 1) * input_shard],
+                              self.weight.transpose(-1, -2))
+        if self.mp_group is not None:
+            dist.all_reduce(output, group=self.mp_group)


Should use inference_all_reduce since #3919 is merged.

delock · 2023-07-24T02:44:55Z

deepspeed/module_inject/layers.py

+    def forward(self, input):
+        assert input.shape[
+            -1] % self.world_size == 0, 'Please ensure that self.world_size is divisible by input.shape[-1]'
+        input_shard = input.shape[-1] // self.world_size


We can actually make this work with input.shape[-1] not divisible by world_size. Need to consider relationship between this pR and the following PR #4011

delock · 2023-07-24T02:51:55Z

Divide among hidden dim will be a simpler solution. It is consistent with other layers and we can reuce inferece_all_reduce to reduce communication latency further. Low communcation size also means it more friendly to scaleout scenario.

delock · 2023-08-16T02:03:58Z

Hi @RezaYazdaniAminabadi, is this PR still under review, or embedding parallel method still needs more consideration?

dc3671 · 2023-08-16T05:46:52Z

deepspeed/module_inject/replace_module.py

+        replaced_module = replace_fn(replaced_module, ("lm_head", ), 0, "lm_head")
+    elif hasattr(replaced_module, "embed_out") and hasattr(replaced_module.embed_out,
+                                                           "weight") and not replaced_module.embed_out.weight.is_meta:
+        replaced_module = replace_fn(replaced_module, ("embed_out", ), 0, "embed_out")


Hi @RezaYazdaniAminabadi @delock, I refactored some code here, so that it will not integrate so deeply with former replace logic. And it will replace the lm_head or embed_out Linear after all other replace/load logics are finished. It will be more decoupling and cleaner now.

dc3671 · 2023-08-16T05:50:01Z

About the performance, in our simple test, it will get an around 10% end2end improvement for bloom-176B on per token latency.

dc3671 · 2023-08-16T05:52:13Z

Also, if need to make this function optional, I think we need to add an item in DeepSpeed's Config. Maybe this needs further discussion?

RezaYazdaniAminabadi

All LGTM, thanks everyone for making this part of inference faster :)

dc3671 · 2023-10-07T03:27:39Z

Hi @tjruwase , I have fixed the merge conflict.

dc3671 · 2023-10-09T05:34:45Z

Hi @tjruwase , the CI problems are fixed.

* add lm_head and embed_out tensor parallel * fix load lm_head.weight name issue * replace all_reduce with inference_all_reduce * refactor lm_head tensor parallel --------- Co-authored-by: Chen, Zhenhuan <[email protected]>

This reverts commit 6763e2d.

Yejing-Lai requested review from RezaYazdaniAminabadi, jeffra, mrwyattii, awan-10, cmikeh2 and arashb as code owners July 14, 2023 14:17

RezaYazdaniAminabadi closed this Jul 14, 2023

RezaYazdaniAminabadi reopened this Jul 14, 2023

delock reviewed Jul 24, 2023

View reviewed changes

dc3671 force-pushed the lyj/lmhead_tp branch from 3036650 to 92303cb Compare August 16, 2023 04:07

dc3671 reviewed Aug 16, 2023

View reviewed changes

delock mentioned this pull request Sep 20, 2023

(Do not merge) (CPU) aggregation of few recent fixes/optimizations #3920

Closed

25 tasks

RezaYazdaniAminabadi approved these changes Sep 29, 2023

View reviewed changes

RezaYazdaniAminabadi enabled auto-merge September 29, 2023 18:25

tjruwase disabled auto-merge October 3, 2023 13:39

dc3671 force-pushed the lyj/lmhead_tp branch from ea7fb28 to 10fbdbc Compare October 7, 2023 03:26

Yejing-Lai and others added 4 commits October 7, 2023 23:01

add lm_head and embed_out tensor parallel

fecbfdd

fix load lm_head.weight name issue

3ef3231

replace all_reduce with inference_all_reduce

e3e60e5

refactor lm_head tensor parallel

fc64ef5

dc3671 force-pushed the lyj/lmhead_tp branch from ecc676a to fc64ef5 Compare October 8, 2023 03:08

tjruwase added this pull request to the merge queue Oct 9, 2023

Merged via the queue into microsoft:master with commit 6763e2d Oct 9, 2023
15 checks passed

jianan-gu added a commit to jianan-gu/DeepSpeedSYCLSupport that referenced this pull request Jan 5, 2024

Revert "add lm_head and embed_out tensor parallel (microsoft#3962)"

4006408

This reverts commit 6763e2d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add lm_head and embed_out tensor parallel #3962

add lm_head and embed_out tensor parallel #3962

Yejing-Lai commented Jul 14, 2023

RezaYazdaniAminabadi commented Jul 14, 2023 •

edited

Loading

Yejing-Lai commented Jul 17, 2023

delock Jul 24, 2023

delock Jul 24, 2023

delock commented Jul 24, 2023

delock commented Aug 16, 2023

dc3671 Aug 16, 2023

dc3671 commented Aug 16, 2023 •

edited

Loading

dc3671 commented Aug 16, 2023

RezaYazdaniAminabadi left a comment

dc3671 commented Oct 7, 2023

dc3671 commented Oct 9, 2023

add lm_head and embed_out tensor parallel #3962

add lm_head and embed_out tensor parallel #3962

Conversation

Yejing-Lai commented Jul 14, 2023

RezaYazdaniAminabadi commented Jul 14, 2023 • edited Loading

Yejing-Lai commented Jul 17, 2023

delock Jul 24, 2023

Choose a reason for hiding this comment

delock Jul 24, 2023

Choose a reason for hiding this comment

delock commented Jul 24, 2023

delock commented Aug 16, 2023

dc3671 Aug 16, 2023

Choose a reason for hiding this comment

dc3671 commented Aug 16, 2023 • edited Loading

dc3671 commented Aug 16, 2023

RezaYazdaniAminabadi left a comment

Choose a reason for hiding this comment

dc3671 commented Oct 7, 2023

dc3671 commented Oct 9, 2023

RezaYazdaniAminabadi commented Jul 14, 2023 •

edited

Loading

dc3671 commented Aug 16, 2023 •

edited

Loading