Saving checkpoints in deepspeed #1931

base-y · 2022-05-01T23:07:29Z

base-y
May 1, 2022

@tjruwase I trained a large model with huggingface and deepspeed and I see there are different checkpoints being created. One is pytorch_model.bin which I think is the usual pytorch model. But I also see files with .pt extensions being saved. I guess these are deepspeed checkpoints of the model?

I wonder why are both pytorch checkpoints and deepspeed checkpoints saved? Isnt saving pytorch model enough? Is the deepspeed checkpoint only useful if we want to do multi-gpu inference? Would be glad to know the difference.

Also, I have set the logging_steps=4 in deepspeed. But I only see a file created in the output directory called runs/<DATE_TIME>/events... and I cant open it with a text editor. I would like to know where does deepspeed log and what is this events... file?

tjruwase · 2022-05-09T14:21:14Z

tjruwase
May 9, 2022
Maintainer

This only happens when training with ZeRO, which partitions the optimizer states across ranks to reduce per rank memory consumption. The pytorch_model.bin file corresponds to the model weights, which is only saved by rank 0. Note that model weight values are the same on all ranks. The deepspeed checkpoint files, which are actually ZeRO checkpoint files, correspond to the optimizer state partition for each rank. DeepSpeed is able to load this distributed optimizer state checkpoint files for continued training or inference. You can see some docs here.

5 replies

wentinghome May 16, 2024

I only see the .pt checkpoint, there is no .bin checkpoint, is there something wrong with my code or it is expected to work in this way?

tjruwase May 16, 2024
Maintainer

Can you share the ls of your checkpoint folder?

wentinghome May 16, 2024

 ls
config.json             latest             rng_state_0.pth          tokenizer.json         training_args.bin
generation_config.json  model.safetensors  rng_state_1.pth          tokenizer_config.json  zero_to_fp32.py
global_step14           pytorch_model.bin  special_tokens_map.json  trainer_state.json

aha, i found i can convert the .pt file into .bin file, using the zero_to_fp32.py.
can you help me understand what .pt file is? thank you.
is it a best practice to convert back to .bin file and evaluate model using transformer setup.
thank you!

tjruwase May 20, 2024
Maintainer

Yes, users often find it more convenient to train in DeepSpeed but evaluate in transformers.

tjruwase Sep 9, 2024
Maintainer

@ppalantir, please see the following link for optimizer state recovery
https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#zero-checkpoint-fp32-weights-recovery

tjruwase · 2024-09-09T16:21:59Z

tjruwase
Sep 9, 2024
Maintainer

Can you share ls -l of your checkpoint folder before and after recovery?

3 replies

tjruwase Sep 9, 2024
Maintainer

In: w = torch.load("checkpoint-1268/global_step1268/zero_pp_rank_0_mp_rank_00_model_states.pt")
In: w.keys()

This suggests that your training was done with bf16 not fp16. The optimizer state for bf16 training is saved in bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. Can you confirm if this is correct?

I asked for ls -l so that I could view the file sizes.

Can you use torch.load to also examine the output pytorch_model.bin? The recovery script should have added the optimizer state to it.

tjruwase Sep 9, 2024
Maintainer

3. I use torch.load to also examine the output pytorch_model.bin, but the file only contains the model weight value

This seems to be a problem. Can you confirm that your training was done with a single GPU but using zero stage 3? Also, is this LoRA finetuning? It would be helpful to open a ticket with full repro details.

tjruwase Sep 9, 2024
Maintainer

Yes. The training was done with a single GPU using zero stage 3. It is LoRA fine-tuning.

ZeRO is a memory optimization based on data parallelism, which means running on multiple GPUs. If you model fits on a single GPU, then there is no need to use ZeRO (or even DeepSpeed). It is more efficient to use baseline transformers Trainer in your situation to avoid the challenges of multiple frameworks. Can you clarify why you need ZeRO or DeepSpeed? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving checkpoints in deepspeed #1931

{{title}}

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Saving checkpoints in deepspeed #1931

base-y May 1, 2022

Replies: 2 comments · 8 replies

tjruwase May 9, 2022 Maintainer

wentinghome May 16, 2024

tjruwase May 16, 2024 Maintainer

wentinghome May 16, 2024

tjruwase May 20, 2024 Maintainer

tjruwase Sep 9, 2024 Maintainer

tjruwase Sep 9, 2024 Maintainer

tjruwase Sep 9, 2024 Maintainer

tjruwase Sep 9, 2024 Maintainer

tjruwase Sep 9, 2024 Maintainer

base-y
May 1, 2022

Replies: 2 comments 8 replies

tjruwase
May 9, 2022
Maintainer

tjruwase May 16, 2024
Maintainer

tjruwase May 20, 2024
Maintainer

tjruwase Sep 9, 2024
Maintainer

tjruwase
Sep 9, 2024
Maintainer

tjruwase Sep 9, 2024
Maintainer

tjruwase Sep 9, 2024
Maintainer

tjruwase Sep 9, 2024
Maintainer