Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing got stuck on Google Cloud TPU #1439

Open
XueFuzhao opened this issue Nov 24, 2023 · 1 comment
Open

Checkpointing got stuck on Google Cloud TPU #1439

XueFuzhao opened this issue Nov 24, 2023 · 1 comment

Comments

@XueFuzhao
Copy link

XueFuzhao commented Nov 24, 2023

When training on large models (> 10B or so), we found the checkpointing sometimes got stuck when saving the checkpoints.
For instance, it may work smoothly for saving checkpoints a few times and then suddenly got stuck in one checkpointing process. I found the checkpoints in the /tmp/ dir is incomplete. It seems to be a memory leakage issue.
May I know how to solve it? Thanks!

I'm using Google Cloud Platform with 1024 TPUv3 cores (512 TPU v3 chips).

@lintangsutawika
Copy link

I have a similar issue. This seems to be quite recent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants