You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm using ZeRO with optimizer and parameter offload to run minillm on 2 H100 gpus on a single node. After doing the generation evaluation, I get a timeout during the all_gather step.
Generation Evaluation: 100%|█████████▉| 497/499 [18:29:58<05:20, 160.10s/it][E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.
I've tried increasing the timeout period without success. Are there any other configurations or steps I can take to resolve this timeout issue?
Thank you for your help!
The text was updated successfully, but these errors were encountered:
Hi, I'm using ZeRO with optimizer and parameter offload to run minillm on 2 H100 gpus on a single node. After doing the generation evaluation, I get a timeout during the all_gather step.
I've tried increasing the timeout period without success. Are there any other configurations or steps I can take to resolve this timeout issue?
Thank you for your help!
The text was updated successfully, but these errors were encountered: