Minimize impact on distributed training task when spot instance recycled #1653
Replies: 1 comment
-
Hey @QuantumXiecao, thank you for starting the discussion!
This is a good point! If the checkpoints are saved on the local disk of the spot cluster, they will be deleted when the cluster is preempted. With that in mind, we have provided a SkyPilot Storage feature to mount a cloud bucket, such as S3 and GCS, to a local directory on the spot VM. As mentioned in our document here, the user program can save the checkpoint to the directory periodically and the checkpoint will stay in the cloud bucket even after preemption. After the recovery of the spot VMs, the same cloud bucket will be mounted to the new VMs again, and the user program can easily load the checkpoints from it and resume the training.
With the mounted cloud bucket support, SkyPilot supported automatic model training/recovering on spot VMs, but the notification of the preemption is still very useful, as the user program can use it as a hint for when to save the checkpoint to reduce the potential progress loss caused by checkpointing that is not frequent enough. |
Beta Was this translation helpful? Give feedback.
-
After reading the code located at spot(controller.py), I reckon that we could resume the training mainly by the checkpoint(supported by pytorch). But if the checkpoint file has been deleted due to spot instance recycle, our resume process might be a big cost. Cloud providers always supply some APIs to notify us or poll status of spot instance to be pre-inform developers that spot instance would be recycled after some certain time(2 minutes for AWS or 5 minutes for AliCloud...)---whether or not at this time span, we could transfer the checkpoint file in order to make the future resume more quickly. Expect to listen to your opinions.
Beta Was this translation helpful? Give feedback.
All reactions