-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for training models with bf16 + freshly initialized optimizer via load_module_only
#4141
Conversation
My model was trained in bf16 mode ,when loading ckpt with
|
@haileyschoelkopf and @Quentin-Anthony, apologies for dropping the ball on this. The PR LGTM. Please resolve the conflict so we can merge. Thanks for the contribution! |
Thanks @tjruwase , merge conflicts resolved! |
How come this PR was never merged? It fixed the finetuning bug I experienced myself, which took about 3-4 days... Specifically, removing the extra condition to check for fp16_enabled is not needed (and hurts bf16 for example).
This should do the trick. |
…ia `load_module_only` (microsoft#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all. in this situation, despite passing `load_module_only=True` and `load_optimizer_states=False` to `load_checkpoint()`, the previous behavior was that: - `self._load_zero_checkpoint` would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using `load_module_only=True` and `load_optimizer_states=False`. Alternatively, calling this function may be alright if `"load_from_fp32_weights": true` is set in the DeepSpeed ZeRO config (reference: https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733) but this parameter does not seem to be documented in the docs for ZeRO config dicts. - in `_load_checkpoint`, the following codeblock: ``` if self.optimizer is not None and self.fp16_enabled(): self.optimizer.refresh_fp32_params() ``` results in `self.optimizer.refresh_fp32_params()` being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition. Previously reported in: EleutherAI/gpt-neox#947 EleutherAI/gpt-neox#843 Should also close: microsoft#4017 Fixes: microsoft#4944 and microsoft#4017 This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior. cc @Quentin-Anthony @zhangir-azerbayev --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all.
in this situation, despite passing
load_module_only=True
andload_optimizer_states=False
toload_checkpoint()
, the previous behavior was that:self._load_zero_checkpoint
would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if usingload_module_only=True
andload_optimizer_states=False
. Alternatively, calling this function may be alright if"load_from_fp32_weights": true
is set in the DeepSpeed ZeRO config (reference:DeepSpeed/deepspeed/runtime/engine.py
Line 733 in ff7d527
_load_checkpoint
, the following codeblock:results in
self.optimizer.refresh_fp32_params()
being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition.Previously reported in:
EleutherAI/gpt-neox#947
EleutherAI/gpt-neox#843
Should also close:
#4017
Fixes: #4944 and #4017
This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior.
cc @Quentin-Anthony @zhangir-azerbayev