[BUG] ZeRO optimizer with MoE Expert Parallelism #5618

Jack47 · 2024-06-05T11:19:21Z

Describe the bug
Just like this PR: #5259 , ZeRO optimizer also needs to be fixed：

To Reproduce
Steps to reproduce the behavior:

use ep=4 and adamw optimizer to train llm

Expected behavior
expert gradients should be equal under ep=4 and ep=1, but currently it's 4 times bigger than ep=1

jomayeri · 2024-06-07T00:16:00Z

@Jack47 Can you make a PR for this? Thanks!

ranzhejiang · 2024-09-14T01:47:11Z

#5681 has solved it @Jack47

Jack47 added bug Something isn't working training labels Jun 5, 2024

jomayeri self-assigned this Jun 7, 2024

Jack47 changed the title ~~[BUG]~~ [BUG] ZeRO optimizer with MoE Expert Parallelism Jun 7, 2024

jomayeri closed this as completed Sep 16, 2024

wyooyw mentioned this issue Sep 17, 2024

[BUG] Expert gradient scaling problem with ZeRO optimizer #6545

Open

Provide feedback