Fixes for training models with bf16 + freshly initialized optimizer via `load_module_only` #4141

haileyschoelkopf · 2023-08-13T23:58:33Z

This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all.

in this situation, despite passing load_module_only=True and load_optimizer_states=False to load_checkpoint(), the previous behavior was that:

self._load_zero_checkpoint would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using load_module_only=True and load_optimizer_states=False. Alternatively, calling this function may be alright if "load_from_fp32_weights": true is set in the DeepSpeed ZeRO config (reference:

DeepSpeed/deepspeed/runtime/engine.py

Line 733 in ff7d527

return self._config.zero_config.load_from_fp32_weights

) but this parameter does not seem to be documented in the docs for ZeRO config dicts.
in _load_checkpoint, the following codeblock:

if self.optimizer is not None and self.fp16_enabled():
    self.optimizer.refresh_fp32_params()

results in self.optimizer.refresh_fp32_params() being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition.

Previously reported in:
EleutherAI/gpt-neox#947
EleutherAI/gpt-neox#843

Should also close:
#4017

Fixes: #4944 and #4017

This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior.

cc @Quentin-Anthony @zhangir-azerbayev

janelu9 · 2023-08-15T06:44:38Z

My model was trained in bf16 mode ,when loading ckpt with load_optimizer_states = False, it still trys to load it . I avoid that by the following :

engine._config.bfloat16_enabled = False
_,ckpt_config=engine.load_checkpoint("check",load_module_only=True,load_optimizer_states= False)
engine._config.bfloat16_enabled = True
engine.optimizer._restore_from_bit16_weights()

Quentin-Anthony · 2023-11-06T23:55:17Z

@tjruwase and @jeffra -- Want any more detail or testing from our side? This fix resolved a lot of issues on our end, and we suspect non-neox users may face it too.

tjruwase · 2023-11-07T15:43:21Z

@haileyschoelkopf and @Quentin-Anthony, apologies for dropping the ball on this. The PR LGTM. Please resolve the conflict so we can merge. Thanks for the contribution!

haileyschoelkopf · 2023-11-07T16:03:13Z

Thanks @tjruwase , merge conflicts resolved!

exnx · 2024-01-14T01:41:11Z

How come this PR was never merged? It fixed the finetuning bug I experienced myself, which took about 3-4 days...

Specifically, removing the extra condition to check for fp16_enabled is not needed (and hurts bf16 for example).

before:
if self.optimizer is not None and self.fp16_enabled():

after:
if self.optimizer is not None:

This should do the trick.

@Quentin-Anthony

…ia `load_module_only` (microsoft#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all. in this situation, despite passing `load_module_only=True` and `load_optimizer_states=False` to `load_checkpoint()`, the previous behavior was that: - `self._load_zero_checkpoint` would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using `load_module_only=True` and `load_optimizer_states=False`. Alternatively, calling this function may be alright if `"load_from_fp32_weights": true` is set in the DeepSpeed ZeRO config (reference: https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733) but this parameter does not seem to be documented in the docs for ZeRO config dicts. - in `_load_checkpoint`, the following codeblock: ``` if self.optimizer is not None and self.fp16_enabled(): self.optimizer.refresh_fp32_params() ``` results in `self.optimizer.refresh_fp32_params()` being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition. Previously reported in: EleutherAI/gpt-neox#947 EleutherAI/gpt-neox#843 Should also close: microsoft#4017 Fixes: microsoft#4944 and microsoft#4017 This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior. cc @Quentin-Anthony @zhangir-azerbayev --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

haileyschoelkopf and others added 3 commits August 9, 2023 13:52

commit 3 hacks

1590973

cleanup

32fc3e9

no longer need [None] check with better fixes

8cbdbaa

haileyschoelkopf requested review from jeffra and tjruwase as code owners August 13, 2023 23:58

kshitijkg mentioned this pull request Nov 6, 2023

bf16 load and finetune fix EleutherAI/DeeperSpeed#57

Merged

tjruwase approved these changes Nov 7, 2023

View reviewed changes

Merge branch 'master' into new-fix

51aa22b

Merge branch 'master' into new-fix

3743929

Merge branch 'master' into new-fix

64c7787

haileyschoelkopf requested a review from mrwyattii as a code owner January 15, 2024 18:50

loadams added 2 commits January 17, 2024 15:21

Merge branch 'master' into new-fix

e11975f

Merge branch 'master' into new-fix

5797d3e

loadams enabled auto-merge January 17, 2024 23:49

loadams added this pull request to the merge queue Jan 18, 2024

loadams mentioned this pull request Jan 18, 2024

[BUG] Loss scale already at minimum - Training LlaMA2 7B via HF+deepspeed consistently fails #4017

Closed

Merged via the queue into microsoft:master with commit 870ae04 Jan 18, 2024
12 checks passed

haileyschoelkopf mentioned this pull request Feb 26, 2024

[BUG] Setting Finetune=True causes checkpoint loading to not work correctly EleutherAI/gpt-neox#1121

Closed

Quentin-Anthony mentioned this pull request Apr 21, 2024

When llama uses bf16 training, there is an abnormal loss EleutherAI/gpt-neox#947

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for training models with bf16 + freshly initialized optimizer via `load_module_only` #4141

Fixes for training models with bf16 + freshly initialized optimizer via `load_module_only` #4141

haileyschoelkopf commented Aug 13, 2023 •

edited by loadams

Loading

janelu9 commented Aug 15, 2023 •

edited

Loading

Quentin-Anthony commented Nov 6, 2023

tjruwase commented Nov 7, 2023

haileyschoelkopf commented Nov 7, 2023

exnx commented Jan 14, 2024

Fixes for training models with bf16 + freshly initialized optimizer via load_module_only #4141

Fixes for training models with bf16 + freshly initialized optimizer via load_module_only #4141

Conversation

haileyschoelkopf commented Aug 13, 2023 • edited by loadams Loading

janelu9 commented Aug 15, 2023 • edited Loading

Quentin-Anthony commented Nov 6, 2023

tjruwase commented Nov 7, 2023

haileyschoelkopf commented Nov 7, 2023

exnx commented Jan 14, 2024

Fixes for training models with bf16 + freshly initialized optimizer via `load_module_only` #4141

Fixes for training models with bf16 + freshly initialized optimizer via `load_module_only` #4141

haileyschoelkopf commented Aug 13, 2023 •

edited by loadams

Loading

janelu9 commented Aug 15, 2023 •

edited

Loading