Replies: 2 comments 8 replies
-
This only happens when training with ZeRO, which partitions the optimizer states across ranks to reduce per rank memory consumption. The |
Beta Was this translation helpful? Give feedback.
-
Can you share |
Beta Was this translation helpful? Give feedback.
-
@tjruwase I trained a large model with huggingface and deepspeed and I see there are different checkpoints being created. One is
pytorch_model.bin
which I think is the usual pytorch model. But I also see files with.pt
extensions being saved. I guess these are deepspeed checkpoints of the model?I wonder why are both pytorch checkpoints and deepspeed checkpoints saved? Isnt saving pytorch model enough? Is the deepspeed checkpoint only useful if we want to do multi-gpu inference? Would be glad to know the difference.
Also, I have set the
logging_steps=4
in deepspeed. But I only see a file created in the output directory calledruns/<DATE_TIME>/events...
and I cant open it with a text editor. I would like to know where does deepspeed log and what is thisevents...
file?Beta Was this translation helpful? Give feedback.
All reactions