Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_mpi_accs stuck #31

Open
wirawan0 opened this issue Apr 8, 2021 · 5 comments
Open

test_mpi_accs stuck #31

wirawan0 opened this issue Apr 8, 2021 · 5 comments

Comments

@wirawan0
Copy link

wirawan0 commented Apr 8, 2021

While building armci-mpi library to use on our cluster, I found that the test_mpi_accs program could not progress. I don't have a good information about where the program are stuck yet. The underlying MPI library is MPICH 3.1. This build was taking place in a Singularity container. The compiler is GCC version 7.3.0 (crosstool-NG 1.23.0.449-a04d0) provided by conda, and
the MPICH library was built with that same GCC toolchain.

@jeffhammond
Copy link
Member

can you run gstack or attach GDB to the stuck process to see where it is?

@wirawan0
Copy link
Author

Hi Jeff, sorry for the delay. Here are the stack traces:

the output of top when test_mpi_accs processes get stuck:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                     
45007 root      20   0   81252  36852   7116 R  82.6  0.0   4:11.94 test_mpi_accs                               
45009 root      20   0   81244  36612   6876 R  82.6  0.0   4:11.46 test_mpi_accs

==> attached by GDB::

(gdb) attach 45007
Attaching to process 45007
Reading symbols from /opt/armci-mpi-master/tests/mpi/test_mpi_accs...done.
Reading symbols from /opt/mpich/lib/libmpi.so.12...(no debugging symbols found)...done.
Loaded symbols for /opt/mpich/lib/libmpi.so.12
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 45022]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /opt/conda/lib/libgfortran.so.4...done.
Loaded symbols for /opt/conda/lib/libgfortran.so.4
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /opt/conda/lib/libgcc_s.so.1...done.
Loaded symbols for /opt/conda/lib/libgcc_s.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /opt/conda/lib/libquadmath.so.0...done.
Loaded symbols for /opt/conda/lib/libquadmath.so.0
0x00007fb9ed0c27f3 in sock_pe_progress_tx_entry () from /opt/mpich/lib/libmpi.so.12
Missing separate debuginfos, use: debuginfo-install glibc-2.17-307.el7.1.x86_64
(gdb) where
#0  0x00007fb9ed0c27f3 in sock_pe_progress_tx_entry () from /opt/mpich/lib/libmpi.so.12
#1  0x00007fb9ed0c3d7b in sock_pe_progress_tx_ctx () from /opt/mpich/lib/libmpi.so.12
#2  0x00007fb9ed0c5056 in sock_pe_progress_ep_tx () from /opt/mpich/lib/libmpi.so.12
#3  0x00007fb9ed0b28b2 in sock_cq_progress () from /opt/mpich/lib/libmpi.so.12
#4  0x00007fb9ed0b2f21 in sock_cq_sreadfrom () from /opt/mpich/lib/libmpi.so.12
#5  0x00007fb9ecba1612 in MPIDI_CH4R_mpi_win_unlock () from /opt/mpich/lib/libmpi.so.12
#6  0x00007fb9ecbe0c23 in PMPI_Win_unlock () from /opt/mpich/lib/libmpi.so.12
#7  0x0000555a27b38356 in main (argc=1, argv=0x7ffc89e639b8) at tests/mpi/test_mpi_accs.c:48
(gdb) attach 45009
Attaching to program: /opt/armci-mpi-master/tests/mpi/test_mpi_accs, process 45009
Reading symbols from /opt/mpich/lib/libmpi.so.12...(no debugging symbols found)...done.
Loaded symbols for /opt/mpich/lib/libmpi.so.12
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 45023]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /opt/conda/lib/libgfortran.so.4...done.
Loaded symbols for /opt/conda/lib/libgfortran.so.4
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /opt/conda/lib/libgcc_s.so.1...done.
Loaded symbols for /opt/conda/lib/libgcc_s.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /opt/conda/lib/libquadmath.so.0...done.
Loaded symbols for /opt/conda/lib/libquadmath.so.0
0x00007f3404d7baeb in recv () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007f3404d7baeb in recv () from /lib64/libpthread.so.0
#1  0x00007f3406008f41 in sock_comm_peek () from /opt/mpich/lib/libmpi.so.12
#2  0x00007f3406003e42 in sock_pe_progress_rx_pe_entry () from /opt/mpich/lib/libmpi.so.12
#3  0x00007f3406006b4e in sock_pe_progress_rx_ctx () from /opt/mpich/lib/libmpi.so.12
#4  0x00007f3406006c36 in sock_pe_progress_ep_rx () from /opt/mpich/lib/libmpi.so.12
#5  0x00007f3405ff58f3 in sock_cq_progress () from /opt/mpich/lib/libmpi.so.12
#6  0x00007f3405ff5f21 in sock_cq_sreadfrom () from /opt/mpich/lib/libmpi.so.12
#7  0x00007f3405ae4612 in MPIDI_CH4R_mpi_win_unlock () from /opt/mpich/lib/libmpi.so.12
#8  0x00007f3405b23c23 in PMPI_Win_unlock () from /opt/mpich/lib/libmpi.so.12
#9  0x000055950d909356 in main (argc=1, argv=0x7ffc580e9478) at tests/mpi/test_mpi_accs.c:48

This "stuck" is a random occasion; it does not always happen.

@wirawan0
Copy link
Author

The MPICH version:

root@wahab-01:/opt/mpich/bin# ./mpichversion 
MPICH Version:          3.3.2
MPICH Release date:     Tue Nov 12 21:23:16 CST 2019
MPICH Device:           ch4:ofi
MPICH configure:        --prefix=/opt/mpich --with-device=ch4:ofi
MPICH CC:       x86_64-conda_cos6-linux-gnu-gcc -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include   -O2
MPICH CXX:      x86_64-conda_cos6-linux-gnu-g++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include  -O2
MPICH F77:      /opt/conda/bin/x86_64-conda_cos6-linux-gnu-gfortran -fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include  -O2
MPICH FC:       x86_64-conda_cos6-linux-gnu-gfortran   -O2
MPICH Custom Information: 

The line 48 above is the very first time MPI_Win_unlock was called in the code (first loop).

@jeffhammond
Copy link
Member

It seems that it is stuck in the libfabric sockets conduit. I have never seen this before.

Try running with MPICH_ASYNC_PROGRESS=1 (no need to recompile anything). If that works, we have a clue.

If that works, can you then rebuild ARMCI-MPI with configure --with-progress .. and run with ARMCI_PROGRESS_THREAD=1 and ARMCI_PROGRESS_USLEEP=100? That is a second clue.

If neither works, I have to assume the container situation has broken something. I do not use containers so I don't know how to debug this.

@wirawan0
Copy link
Author

With MPICH_ASYNC_PROGRESS=1 the result is rather erratic (sometimes it works but about 25% it would be stuck). Here is a test script to launch the same program again and again and test it:

#!/bin/bash

# second test: with MPICH_ASYNC_PROGRESS=1

source /etc/container_runtime/runtime
load_mpich

: ${COUNT:=100}

export MPICH_ASYNC_PROGRESS=1

for c in $(seq 1 $COUNT); do
    echo "Iteration $c - $(date +%Y-%m-%dT%H:%M:%S)"
    tm_start=$(date +%s.%N)
    mpiexec -n 2 ./test_mpi_accs
    tm_end=$(date +%s.%N)
    awk -v tm_start="$tm_start" -v tm_end="$tm_end" -v c="$c" \
        'BEGIN { printf("- iteration %3d  time = %.3f secs\n", c, tm_end - tm_start) }'
    sleep 5s
done

Here is the example output (filtered to only a few relevant texts)--you see 3x stuck here:

- iteration   1  time = 1.089 secs
- iteration   2  time = 2.035 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration   3  time = 15.499 secs
- iteration   4  time = 2.090 secs
- iteration   5  time = 1.494 secs
- iteration   6  time = 2.926 secs
- iteration   7  time = 3.569 secs
- iteration   8  time = 1.292 secs
- iteration   9  time = 1.506 secs
- iteration  10  time = 1.785 secs
- iteration  11  time = 2.163 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration  12  time = 215.610 secs
- iteration  13  time = 2.391 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration  14  time = 24.370 secs

On a related note, I tried to build armci-mpi outside container, I ran into a different issue, which I post on a separate issue - #32 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants