[BUG] Expert gradient scaling problem with ZeRO optimizer #6545

wyooyw · 2024-09-17T15:06:27Z

Describe the bug

When using ZeRO optimizer training MoE model, the gradient of the expert weights is ep_size times larger than the true gradient.

Related issue & pr
Issue [#5618] has described the bug (the second bug in that issue). However, it has been closed. So I create a new issue here
PR [#5259] has fix the bug in bf16 optimizer. ZeRO optimizer also needs to be fixed：

To Reproduce

1.Prepare two models(model1 & model2) using the same input data and initial parameters. They all use ZeRO 1( or 2) optimizer. Model1 uses ep=1, model2 uses ep=2.
2.Perform a forward and backward propagation on both models.
3.Dump the gradient of the expert weights from both models.
4.The gradient of the expert weights in model2 is ep_size times that of model1.

Expected behavior
Gradient should be same under different ep_size.

wyooyw · 2024-09-17T15:13:35Z

I fixed the bug in PR [#6546]. The PR has not been merged yet.

Fix [#6545] work: - expert gradient average: divide edp_world_size -> divide dp_world_size - unit test: make sure model with different dp/ep has same expert gradient --------- Co-authored-by: wangyiou <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Logan Adams <[email protected]>

wyooyw added bug Something isn't working training labels Sep 17, 2024

wyooyw changed the title ~~[BUG]Expert Grad Scaling Problem With Zero Optimizer~~ [BUG] Expert gradient scaling problem with ZeRO optimizer Sep 17, 2024

wyooyw mentioned this issue Sep 17, 2024

Fix expert grad scaling problem with ZeRO optimizer #6546

Merged

tohtana self-assigned this Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Expert gradient scaling problem with ZeRO optimizer #6545

[BUG] Expert gradient scaling problem with ZeRO optimizer #6545

wyooyw commented Sep 17, 2024

wyooyw commented Sep 17, 2024

[BUG] Expert gradient scaling problem with ZeRO optimizer #6545

[BUG] Expert gradient scaling problem with ZeRO optimizer #6545

Comments

wyooyw commented Sep 17, 2024

wyooyw commented Sep 17, 2024