Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for training models with bf16 + freshly initialized optimizer via load_module_only #4141

Merged
merged 8 commits into from
Jan 18, 2024

Conversation

haileyschoelkopf
Copy link
Contributor

@haileyschoelkopf haileyschoelkopf commented Aug 13, 2023

This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all.

in this situation, despite passing load_module_only=True and load_optimizer_states=False to load_checkpoint(), the previous behavior was that:

  • self._load_zero_checkpoint would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using load_module_only=True and load_optimizer_states=False. Alternatively, calling this function may be alright if "load_from_fp32_weights": true is set in the DeepSpeed ZeRO config (reference:
    return self._config.zero_config.load_from_fp32_weights
    ) but this parameter does not seem to be documented in the docs for ZeRO config dicts.
  • in _load_checkpoint, the following codeblock:
if self.optimizer is not None and self.fp16_enabled():
    self.optimizer.refresh_fp32_params()

results in self.optimizer.refresh_fp32_params() being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition.

Previously reported in:
EleutherAI/gpt-neox#947
EleutherAI/gpt-neox#843

Should also close:
#4017

Fixes: #4944 and #4017

This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior.

cc @Quentin-Anthony @zhangir-azerbayev

@janelu9
Copy link

janelu9 commented Aug 15, 2023

My model was trained in bf16 mode ,when loading ckpt with load_optimizer_states = False, it still trys to load it . I avoid that by the following :

engine._config.bfloat16_enabled = False
_,ckpt_config=engine.load_checkpoint("check",load_module_only=True,load_optimizer_states= False)
engine._config.bfloat16_enabled = True
engine.optimizer._restore_from_bit16_weights()

@Quentin-Anthony
Copy link
Contributor

@tjruwase and @jeffra -- Want any more detail or testing from our side? This fix resolved a lot of issues on our end, and we suspect non-neox users may face it too.

@tjruwase
Copy link
Contributor

tjruwase commented Nov 7, 2023

@haileyschoelkopf and @Quentin-Anthony, apologies for dropping the ball on this. The PR LGTM. Please resolve the conflict so we can merge. Thanks for the contribution!

@haileyschoelkopf
Copy link
Contributor Author

Thanks @tjruwase , merge conflicts resolved!

@exnx
Copy link

exnx commented Jan 14, 2024

How come this PR was never merged? It fixed the finetuning bug I experienced myself, which took about 3-4 days...

Specifically, removing the extra condition to check for fp16_enabled is not needed (and hurts bf16 for example).

before:
if self.optimizer is not None and self.fp16_enabled():

after:
if self.optimizer is not None:

This should do the trick.

@loadams loadams enabled auto-merge January 17, 2024 23:49
@loadams loadams added this pull request to the merge queue Jan 18, 2024
Merged via the queue into microsoft:master with commit 870ae04 Jan 18, 2024
12 checks passed
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
…ia `load_module_only` (microsoft#4141)

This PR makes some fixes to the case where we want to resume training
from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while
not using the old optimizer in the checkpoint or relying on its
existence at all.

in this situation, despite passing `load_module_only=True` and
`load_optimizer_states=False` to `load_checkpoint()`, the previous
behavior was that:
- `self._load_zero_checkpoint` would still be called, which attempts to
load from the (in this case, nonexistent) checkpoint files. This PR
stops this function from being called if using `load_module_only=True`
and `load_optimizer_states=False`. Alternatively, calling this function
may be alright if `"load_from_fp32_weights": true` is set in the
DeepSpeed ZeRO config (reference:
https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733)
but this parameter does not seem to be documented in the docs for ZeRO
config dicts.
- in `_load_checkpoint`, the following codeblock: 
```
if self.optimizer is not None and self.fp16_enabled():
    self.optimizer.refresh_fp32_params()
```
results in `self.optimizer.refresh_fp32_params()` being called only if
using FP16. As a result, the FP32 optimizer state is never initialized
from the 16-bit model weights. This PR removes the fp16-specific
condition.


Previously reported in:
EleutherAI/gpt-neox#947
EleutherAI/gpt-neox#843

Should also close:
microsoft#4017

Fixes: microsoft#4944 and microsoft#4017

This caused problems for a freshly-converted LLama checkpoint, which did
not contain optimizer states, when trying to train with this model as
initialization. I have confirmed the following fixes prevent this
behavior.

cc @Quentin-Anthony @zhangir-azerbayev

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Setting Finetune=True causes checkpoint loading to not work correctly
6 participants