You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for the great work! I've been fine-tuning Prolong on my own dataset and it works really well.
I am curious if you have trained with multiple nodes before. I saw that train_sft.sh could handle more than 1 node (line 67-77). However, when I modified the code to work on my custom dataset, I always got torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError, even when rdzv-conf timeout is set to 600.
Here is my code if it helps. I want to fine-tuning using 8 A100s across 2 nodes:
If you use slurm then srun is important, because it makes sure that the job is launched on all nodes. Without these ever running, the head node gets the rendezvous timeout error. So probably try to understand why you get the slurm specification error with srun!
Hello,
Thank you for the great work! I've been fine-tuning Prolong on my own dataset and it works really well.
I am curious if you have trained with multiple nodes before. I saw that
train_sft.sh
could handle more than 1 node (line 67-77). However, when I modified the code to work on my custom dataset, I always gottorch.distributed.elastic.rendezvous.api.RendezvousTimeoutError
, even whenrdzv-conf timeout
is set to 600.Here is my code if it helps. I want to fine-tuning using 8 A100s across 2 nodes:
Curious to hear your insights!
The text was updated successfully, but these errors were encountered: