[BUG] Setting Finetune=True causes checkpoint loading to not work correctly #1121

exnx · 2024-01-12T05:17:43Z

We noticed that with finetune: True, DeepSpeed is somehow incorrectly loading the checkpoint. We think it has something to do with module vs. optimizer parameters not being updated (or tied) correctly, similarly described in this issue. Supposedly this was fixed with pull requests, but I think it broke again at some point.

We were able to pinpoint this bug to happen between DeepSpeed version 0.10.0 (works ok), and 0.10.2 (which has this bug). Specifically, we tested this by using the finetune flag like a resume function, where we set the learning rate to be the same as where pretraining left off. We would expect the loss to be in range as during pretraining. (With finetune=True, the only difference from resuming pretraining is that the optimizer states are not loaded, but we set the learning rate to be there same as where it left off).

What happens instead is, the very first step (step 0) is close where pretraining left off, but then the next step it jumps super high, and then tapers down (not fully recovering), but basically behaves like it is training from scratch there after. This happens in newer DeepSpeed versions.

Using the older DeepSpeed version 0.10.0 avoids this behavior, and the loss starts off and stays inline with pretraining, ie continuing to improve. When I upgrade to anything after this version (0.10.2+), I get the bug.

I am using Torch 2.0, Cuda 11.7, on Ubuntu 22.04.

Has anybody had issue with finetune settings using DeepSpeed? Thanks!

The text was updated successfully, but these errors were encountered:

kyriemao · 2024-01-13T12:57:20Z

Same issue.

exnx · 2024-01-13T20:53:32Z

Update: it's not dependent on DeepSpeed version actually. It's dependent on whether optimizer states are available or not. If I force no loading via no_load_optim, this bug appears in both new and old DeepSpeed version. So my guess is that there's something wrong with the optimizer using the correct (pretrained) parameters when no optimizer states are loaded.

haileyschoelkopf · 2024-02-26T16:27:16Z

@exnx I recall seeing you comment here: microsoft/DeepSpeed#4141 did this fix your issue, or does it still persist?

exnx · 2024-03-01T08:28:57Z

I believe so! Thanks.

haileyschoelkopf · 2024-03-01T15:59:24Z

Awesome, glad it did! This is merged in DeepSpeed and in DeeperSpeed, so closing this issue.

exnx added the bug Something isn't working label Jan 12, 2024

Quentin-Anthony self-assigned this Jan 15, 2024

haileyschoelkopf closed this as completed Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Setting Finetune=True causes checkpoint loading to not work correctly #1121

[BUG] Setting Finetune=True causes checkpoint loading to not work correctly #1121

exnx commented Jan 12, 2024 •

edited

Loading

kyriemao commented Jan 13, 2024

exnx commented Jan 13, 2024

haileyschoelkopf commented Feb 26, 2024

exnx commented Mar 1, 2024

haileyschoelkopf commented Mar 1, 2024

[BUG] Setting Finetune=True causes checkpoint loading to not work correctly #1121

[BUG] Setting Finetune=True causes checkpoint loading to not work correctly #1121

Comments

exnx commented Jan 12, 2024 • edited Loading

kyriemao commented Jan 13, 2024

exnx commented Jan 13, 2024

haileyschoelkopf commented Feb 26, 2024

exnx commented Mar 1, 2024

haileyschoelkopf commented Mar 1, 2024

exnx commented Jan 12, 2024 •

edited

Loading