-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison of Deepspeed Stage 1,2 and 3 vs DDP #4815
Comments
@jpatel-bdai, all zero stages are expected to match ddp on single gpu runs. So, it appears that you are hitting bugs in zero. Are you able to share detailed steps to help us repro? Thanks! |
I will try to share the detailed steps to reproduce if possible. I am using the pytorch-lightning's Deepspeed Strategy. However, are all zero stages expected to match ddp on multi-gpu runs as well? What are the ways to debug the comparison if I am unable to share the code? |
Ideally, we expect zero stages to match ddp in multi-gpu runs, since zero is designed to be a memory-efficient ddp algorithm. In terms of debugging, a first step would be to inspect the training loss of each forward pass to detect deviations. |
Don't want to hijack this issue, but I noticed that my train loss values are wildly different between stage 2 and stage 3, is that expected? I take that minor differences can happen because of different optimizer implementations but the differences in my case is too severe - I checked that everything was seeded the same way and with multiple restarts of stage 2 and stage 3 results were not exact but consistent with the same stage but not across |
@tjruwase I have an issue registered here Lightning-AI/pytorch-lightning#19246 but it looks like the issue is from Deepspeed. Here is the sample script where I tried to compare DDP and Deepspeed with a simple MNIST example on a single GPU. During the backward pass, the model weights are updated differently by the https://deepspeed.readthedocs.io/en/stable/_modules/deepspeed/runtime/zero/stage_1_and_2.html optimizer in Deepspeed vs Adam in DDP.
#- Lightning Component (e.g. Trainer, LightningModule): Trainer, LightningModule |
@GuanhuaWang, @tjruwase and @jomayeri Do you have any findings to share on this? Is there a minimal example comparing DDP and Deepspeed ZeRO where the parameter updates are identical? |
Hi @jpatel-bdai, |
@jpatel-bdai Let me share my verification script. As long as I set FP32, PyTorch's Adam, and NP=2, it showed exact matches with PyTorch. |
Let me close this issue as we haven't had a new report for a while. Please feel free to reopen it if you still see the issue. |
Describe the bug
When the model fits on a single GPU, how does Deepspeed ZeRO stage 1 compare with DDP? In my experiments, the Deepspeed ZeRO stage 1. I see that my overall loss training progresses similarly in both the cases but after a few iterations, the Deepspeed ZeRO stage 1 and stage 2 performance degrades.
Expected behavior
I would expect both DDP and Deepspeed ZeRO Stage 1 to give similar results when run of single GPU. The total loss is a combination of a few losses and one of which is trans loss. Do you have experiments that compare DDP and Deepspeed ZeRO stage 1 or 2 that I can refer. Are these supposed to give similar performance? The attached screenshots are for single GPU and 2 GPU experiments for total loss and trans loss.
Screenshots
System info (please complete the following information):
Docker context
No
The text was updated successfully, but these errors were encountered: