You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training on large models (> 10B or so), we found the checkpointing sometimes got stuck when saving the checkpoints.
For instance, it may work smoothly for saving checkpoints a few times and then suddenly got stuck in one checkpointing process. I found the checkpoints in the /tmp/ dir is incomplete. It seems to be a memory leakage issue.
May I know how to solve it? Thanks!
I'm using Google Cloud Platform with 1024 TPUv3 cores (512 TPU v3 chips).
The text was updated successfully, but these errors were encountered:
When training on large models (> 10B or so), we found the checkpointing sometimes got stuck when saving the checkpoints.
For instance, it may work smoothly for saving checkpoints a few times and then suddenly got stuck in one checkpointing process. I found the checkpoints in the /tmp/ dir is incomplete. It seems to be a memory leakage issue.
May I know how to solve it? Thanks!
I'm using Google Cloud Platform with 1024 TPUv3 cores (512 TPU v3 chips).
The text was updated successfully, but these errors were encountered: