Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Expert gradient scaling problem with ZeRO optimizer #6545

Open
wyooyw opened this issue Sep 17, 2024 · 1 comment
Open

[BUG] Expert gradient scaling problem with ZeRO optimizer #6545

wyooyw opened this issue Sep 17, 2024 · 1 comment
Assignees
Labels
bug Something isn't working training

Comments

@wyooyw
Copy link
Contributor

wyooyw commented Sep 17, 2024

Describe the bug

When using ZeRO optimizer training MoE model, the gradient of the expert weights is ep_size times larger than the true gradient.

Related issue & pr
Issue [#5618] has described the bug (the second bug in that issue). However, it has been closed. So I create a new issue here
PR [#5259] has fix the bug in bf16 optimizer. ZeRO optimizer also needs to be fixed:

To Reproduce

1.Prepare two models(model1 & model2) using the same input data and initial parameters. They all use ZeRO 1( or 2) optimizer. Model1 uses ep=1, model2 uses ep=2.
2.Perform a forward and backward propagation on both models.
3.Dump the gradient of the expert weights from both models.
4.The gradient of the expert weights in model2 is ep_size times that of model1.

Expected behavior
Gradient should be same under different ep_size.

@wyooyw wyooyw added bug Something isn't working training labels Sep 17, 2024
@wyooyw wyooyw changed the title [BUG]Expert Grad Scaling Problem With Zero Optimizer [BUG] Expert gradient scaling problem with ZeRO optimizer Sep 17, 2024
@wyooyw
Copy link
Contributor Author

wyooyw commented Sep 17, 2024

I fixed the bug in PR [#6546]. The PR has not been merged yet.

@tohtana tohtana self-assigned this Sep 17, 2024
github-merge-queue bot pushed a commit that referenced this issue Oct 23, 2024
Fix [#6545]

work:
- expert gradient average: divide edp_world_size -> divide dp_world_size
- unit test: make sure model with different dp/ep has same expert
gradient

---------

Co-authored-by: wangyiou <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants