Checkpointing got stuck on Google Cloud TPU #1439

XueFuzhao · 2023-11-24T08:47:17Z

When training on large models (> 10B or so), we found the checkpointing sometimes got stuck when saving the checkpoints.
For instance, it may work smoothly for saving checkpoints a few times and then suddenly got stuck in one checkpointing process. I found the checkpoints in the /tmp/ dir is incomplete. It seems to be a memory leakage issue.
May I know how to solve it? Thanks!

I'm using Google Cloud Platform with 1024 TPUv3 cores (512 TPU v3 chips).

lintangsutawika · 2024-01-02T13:15:03Z

I have a similar issue. This seems to be quite recent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing got stuck on Google Cloud TPU #1439

Checkpointing got stuck on Google Cloud TPU #1439

XueFuzhao commented Nov 24, 2023 •

edited

Loading

lintangsutawika commented Jan 2, 2024

Checkpointing got stuck on Google Cloud TPU #1439

Checkpointing got stuck on Google Cloud TPU #1439

Comments

XueFuzhao commented Nov 24, 2023 • edited Loading

lintangsutawika commented Jan 2, 2024

XueFuzhao commented Nov 24, 2023 •

edited

Loading