Minimize impact on distributed training task when spot instance recycled #1653

QuantumXiecao · 2023-02-01T04:25:13Z

QuantumXiecao
Feb 1, 2023

After reading the code located at spot(controller.py), I reckon that we could resume the training mainly by the checkpoint(supported by pytorch). But if the checkpoint file has been deleted due to spot instance recycle, our resume process might be a big cost. Cloud providers always supply some APIs to notify us or poll status of spot instance to be pre-inform developers that spot instance would be recycled after some certain time(2 minutes for AWS or 5 minutes for AliCloud...)---whether or not at this time span, we could transfer the checkpoint file in order to make the future resume more quickly. Expect to listen to your opinions.

Michaelvll · 2023-02-01T18:44:54Z

Michaelvll
Feb 1, 2023
Maintainer

Hey @QuantumXiecao, thank you for starting the discussion!

I reckon that we could resume the training mainly by the checkpoint(supported by pytorch). But if the checkpoint file has been deleted due to spot instance recycle, our resume process might be a big cost.

This is a good point! If the checkpoints are saved on the local disk of the spot cluster, they will be deleted when the cluster is preempted. With that in mind, we have provided a SkyPilot Storage feature to mount a cloud bucket, such as S3 and GCS, to a local directory on the spot VM. As mentioned in our document here, the user program can save the checkpoint to the directory periodically and the checkpoint will stay in the cloud bucket even after preemption. After the recovery of the spot VMs, the same cloud bucket will be mounted to the new VMs again, and the user program can easily load the checkpoints from it and resume the training.
Please feel free to check out our end-to-end spot job example here, where we train the BERT model on spot VMs without worrying about progress loss.

Cloud providers always supply some APIs to notify us or poll status of spot instance to be pre-inform developers that spot instance would be recycled after some certain time(2 minutes for AWS or 5 minutes for AliCloud...)---whether or not at this time span, we could transfer the checkpoint file in order to make the future resume more quickly.

With the mounted cloud bucket support, SkyPilot supported automatic model training/recovering on spot VMs, but the notification of the preemption is still very useful, as the user program can use it as a hint for when to save the checkpoint to reduce the potential progress loss caused by checkpointing that is not frequent enough.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize impact on distributed training task when spot instance recycled #1653

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Minimize impact on distributed training task when spot instance recycled #1653

QuantumXiecao Feb 1, 2023

Replies: 1 comment

Michaelvll Feb 1, 2023 Maintainer

QuantumXiecao
Feb 1, 2023

Michaelvll
Feb 1, 2023
Maintainer