Timeout Error in all_gather during evaluate_ppo() on 2 H100 GPUs with miniLLM and ZeRO #127

Ispanicus · 2023-12-12T11:51:59Z

Hi, I'm using ZeRO with optimizer and parameter offload to run minillm on 2 H100 gpus on a single node. After doing the generation evaluation, I get a timeout during the all_gather step.

Generation Evaluation: 100%|█████████▉| 497/499 [18:29:58<05:20, 160.10s/it][E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.

I've tried increasing the timeout period without success. Are there any other configurations or steps I can take to resolve this timeout issue?

Thank you for your help!

The text was updated successfully, but these errors were encountered:

donglixp · 2023-12-12T12:54:27Z

Have you tried A100s or V100s? I am unsure whether the above error only appears with H100s.

Ispanicus · 2023-12-12T13:27:10Z

I unfortunately only have access to 2 H100s. It could be an issue, since they run on cuda sm_90, but I wouldn't know where to begin to debug that.

Ispanicus closed this as completed Dec 12, 2023

Ispanicus reopened this Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout Error in all_gather during evaluate_ppo() on 2 H100 GPUs with miniLLM and ZeRO #127

Timeout Error in all_gather during evaluate_ppo() on 2 H100 GPUs with miniLLM and ZeRO #127

Ispanicus commented Dec 12, 2023

donglixp commented Dec 12, 2023

Ispanicus commented Dec 12, 2023 •

edited

Loading

Timeout Error in all_gather during evaluate_ppo() on 2 H100 GPUs with miniLLM and ZeRO #127

Timeout Error in all_gather during evaluate_ppo() on 2 H100 GPUs with miniLLM and ZeRO #127

Comments

Ispanicus commented Dec 12, 2023

donglixp commented Dec 12, 2023

Ispanicus commented Dec 12, 2023 • edited Loading

Ispanicus commented Dec 12, 2023 •

edited

Loading