ZeRO0 does not handle BF16 gradients properly #5154

tohtana · 2024-02-19T08:38:31Z

The combination of BF16 and ZeRO0 (no ZeRO optimization) has some issues with handling gradients. BF16_Optimizer seems to be designed to accumulate gradients in FP32, but this doesn't match other parts.

DeepSpeed engine converts BF16 gradients and accumulate them in FP32 soon after the backward pass using BF16_Optimizer. However, the engine performs allreduce on BF16 gradients after that. As the result, gradients are not properly reduced.
The engine calls BF16_Optimizer's backward(). This clears BF16 gradients. So it doesn't work when the gradient accumulation step > 1.

There are two possible approaches to resolve this issue:

Accumulate gradients in BF16 until the boundary of gradient accumulation steps (We should clear BF16 gradient only at the gradient accumulation boundary). Then perform allreduce, conversion to FP32, and parameter update.
Accumulate gradients in FP32 and run allreduce.

This PR takes the first approach to resolve the issue. ZeRO 1/2/3 follows the first approach though the second one has the advantage in terms the precision for gradient accumulation.

tohtana · 2024-02-21T16:58:25Z

This PR breaks PP. Opened #5170 as another solution.

This PR fixes an issue with allreducing for ZeRO0 + BF16. (This replaces #5154) DeepSpeed uses `BF16_Optimizer` when ZeRO0 and BF16 are enabled. The optimizer accumulates gradients on FP32 buffer soon after a backward pass completes. However, DeepSpeed engine performs allreduce on BF16 gradients. This PR fixes the issue by performing allreduce on the FP32 buffer. It also eliminates an assertion that prohibits BF16+PP+Z1, which is actually runnable. This shows loss curves of the following conditions: - BF16/Z0,Z1,Z2,Z3/NoPP - BF16/Z0,Z1/PP(2 stages) (all used 8GPUs, gradient accumulation step: 4) ![image](https://github.com/microsoft/DeepSpeed/assets/81312776/0dc1e9ef-43bc-4b47-8b9e-d6aca137a217) --------- Co-authored-by: Logan Adams <[email protected]>

This PR fixes an issue with allreducing for ZeRO0 + BF16. (This replaces microsoft#5154) DeepSpeed uses `BF16_Optimizer` when ZeRO0 and BF16 are enabled. The optimizer accumulates gradients on FP32 buffer soon after a backward pass completes. However, DeepSpeed engine performs allreduce on BF16 gradients. This PR fixes the issue by performing allreduce on the FP32 buffer. It also eliminates an assertion that prohibits BF16+PP+Z1, which is actually runnable. This shows loss curves of the following conditions: - BF16/Z0,Z1,Z2,Z3/NoPP - BF16/Z0,Z1/PP(2 stages) (all used 8GPUs, gradient accumulation step: 4) ![image](https://github.com/microsoft/DeepSpeed/assets/81312776/0dc1e9ef-43bc-4b47-8b9e-d6aca137a217) --------- Co-authored-by: Logan Adams <[email protected]>

udpate fp32 master params of bf16 optimizer after allreduce

dc80106

tohtana changed the base branch from master to tohtana/fix_fp32_clipping February 19, 2024 08:45

tohtana marked this pull request as ready for review February 19, 2024 09:33

tohtana requested review from mrwyattii and tjruwase as code owners February 19, 2024 09:33

fix gradient accumulation

cbda88d

tohtana changed the title ~~Update bf16 optimizer's master weights after allreduce~~ ZeRO0 does not handle BF16 gradients properly Feb 20, 2024

tohtana mentioned this pull request Feb 21, 2024

Fix allreduce for BF16 and ZeRO0 #5170

Merged

tohtana closed this Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeRO0 does not handle BF16 gradients properly #5154

ZeRO0 does not handle BF16 gradients properly #5154

tohtana commented Feb 19, 2024 •

edited

Loading

tohtana commented Feb 21, 2024

ZeRO0 does not handle BF16 gradients properly #5154

ZeRO0 does not handle BF16 gradients properly #5154

Conversation

tohtana commented Feb 19, 2024 • edited Loading

tohtana commented Feb 21, 2024

tohtana commented Feb 19, 2024 •

edited

Loading