-
Hello, First of all, I hope this is the right place to post my question, I am trying to run an MPI application in a cluster with omnipath. However, I am a out of idea to debug because:
I am using ucx 1.15 (I did have the issue with 1.14 as well). And my setup is a bit uncommon as I am using mpi with nixos (the packages are defined here: https://github.com/oar-team/nur-kapack/tree/ucx-upgrade) If anyone as an idea on how to debug this I'll gladly accept your help. This is the command that I execute:
and this is the error:
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
Hi @adfaure , can you pls try adding |
Beta Was this translation helpful? Give feedback.
-
what is the difference between your image and the default environment? |
Beta Was this translation helpful? Give feedback.
Hello, I don't think the issue was related to UCX.
I fact I had two network interfaces connected and configured on the same network, which causes MPI to fail on some timeouts.
Now that I identified the issue, shutting down one interface fixes the issue.
Thank for your time :)