-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mlx5 connect on mlx5_1 failed: Connection timed out #9971
Labels
Comments
@shinoharakazuya can you pls post the output of |
@jandres742 FYI |
NOTE: This issue happens on Nvidia internal cluster |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
I'm running NGC's hpl benchmark test from Slurm. When I ran hpl in an hpl container on two servers with 8 GPUs per node, I encountered a UCX error.
Steps to Reproduce
ucx_info -v
): Please see log file.Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
cat /etc/mlnx-release
(the string identifies software and firmware setup)rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
ibstat
oribv_devinfo -vv
commandlsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
: Please see log file.Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCX: Please see log file.The text was updated successfully, but these errors were encountered: