Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the Qwen2-MoE and DeepSeek-V2 models on multiple GPUs/HPUs #6964

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

gyou2021
Copy link
Contributor

@gyou2021 gyou2021 commented Jan 21, 2025

Reduced the routed experts' AllReduce operation times per MoE layer to ONCE for the Qwen2-MoE and DeepSeek-V2 models. The results of all selected routed experts per layer on GPU/HPU cards will be gathered ONCE using the AllReduce operation, instead of gathering each selected routed expert individually or by the number of selected routed experts. This change will greatly increase performance.
In addition to modifying auto_tp.py, the following files should be updated: modeling_qwen2_moe.py and modeling_deepseek_v2.py. Add the following code after the weighted sum of the output of the selected experts per MoE layer.
if is_deepspeed_available():
from deepspeed import comm as dist
if dist.is_initialized():
dist.all_reduce(final_hidden_states, op=dist.ReduceOp.SUM)
Notes: final_hidden_states is the result of the weighted sum of the output of the selected experts per MoE layer.

@delock
Copy link
Collaborator

delock commented Jan 21, 2025

Hi @gyou2021 , There is another PR for DeepSeek autotp in this link #6937 What is relationship of your PR to this previous PR regarding on functionality?

@Yejing-Lai can you take a look at this PR?

@gyou2021
Copy link
Contributor Author

Hi @gyou2021 , There is another PR for DeepSeek autotp in this link #6937 What is relationship of your PR to this previous PR regarding on functionality?

@Yejing-Lai can you take a look at this PR?

The difference lies in how the results of the weighted sum of routed experts per layer in the MoE are gathered. In my understanding, each selected routed expert per layer was gathered individually in #6937, meaning the gathering time was proportional to the number of selected routed experts per layer. In this PR, the results are gathered once per layer, regardless of the number of selected routed experts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants