-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_mpi_accs stuck #31
Comments
can you run |
Hi Jeff, sorry for the delay. Here are the stack traces: the output of top when test_mpi_accs processes get stuck:
==> attached by GDB::
This "stuck" is a random occasion; it does not always happen. |
The MPICH version:
The line 48 above is the very first time |
It seems that it is stuck in the libfabric sockets conduit. I have never seen this before. Try running with If that works, can you then rebuild ARMCI-MPI with If neither works, I have to assume the container situation has broken something. I do not use containers so I don't know how to debug this. |
With
Here is the example output (filtered to only a few relevant texts)--you see 3x stuck here:
On a related note, I tried to build armci-mpi outside container, I ran into a different issue, which I post on a separate issue - #32 . |
While building armci-mpi library to use on our cluster, I found that the test_mpi_accs program could not progress. I don't have a good information about where the program are stuck yet. The underlying MPI library is MPICH 3.1. This build was taking place in a Singularity container. The compiler is GCC version 7.3.0 (crosstool-NG 1.23.0.449-a04d0) provided by conda, and
the MPICH library was built with that same GCC toolchain.
The text was updated successfully, but these errors were encountered: