You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How to make adjustment for these settings? Unluckily I met with the error: File "../miniconda3/envs/rdt/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast work = group.broadcast([tensor], opts) RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
The text was updated successfully, but these errors were encountered:
`# export NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_7:1,mlx5_8:1,mlx5_9:1
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=bond0
export NCCL_DEBUG=INFO
export NCCL_NVLS_ENABLE=0`
How to make adjustment for these settings? Unluckily I met with the error:
File "../miniconda3/envs/rdt/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast work = group.broadcast([tensor], opts) RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
The text was updated successfully, but these errors were encountered: