You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using ZeRO optimizer training MoE model, the gradient of the expert weights is ep_size times larger than the true gradient.
Related issue & pr
Issue [#5618] has described the bug (the second bug in that issue). However, it has been closed. So I create a new issue here
PR [#5259] has fix the bug in bf16 optimizer. ZeRO optimizer also needs to be fixed:
To Reproduce
1.Prepare two models(model1 & model2) using the same input data and initial parameters. They all use ZeRO 1( or 2) optimizer. Model1 uses ep=1, model2 uses ep=2.
2.Perform a forward and backward propagation on both models.
3.Dump the gradient of the expert weights from both models.
4.The gradient of the expert weights in model2 is ep_size times that of model1.
Expected behavior
Gradient should be same under different ep_size.
The text was updated successfully, but these errors were encountered:
Describe the bug
When using ZeRO optimizer training MoE model, the gradient of the expert weights is ep_size times larger than the true gradient.
Related issue & pr
Issue [#5618] has described the bug (the second bug in that issue). However, it has been closed. So I create a new issue here
PR [#5259] has fix the bug in bf16 optimizer. ZeRO optimizer also needs to be fixed:
To Reproduce
1.Prepare two models(model1 & model2) using the same input data and initial parameters. They all use ZeRO 1( or 2) optimizer. Model1 uses ep=1, model2 uses ep=2.
2.Perform a forward and backward propagation on both models.
3.Dump the gradient of the expert weights from both models.
4.The gradient of the expert weights in model2 is ep_size times that of model1.
Expected behavior
Gradient should be same under different ep_size.
The text was updated successfully, but these errors were encountered: