Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed assertion in Intel MPI when running a large number of tests #7

Open
BerengerBerthoul opened this issue Aug 28, 2024 · 1 comment

Comments

@BerengerBerthoul
Copy link
Member

BerengerBerthoul commented Aug 28, 2024

We have a very strange behavior when running the Maia unit test suite with pytest_parallel and Intel MPI :

/[...]/maia/maia/my_debug/transfer/test/test_te_utils.py::test_create_all_elt_distribution[2] Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2279: comm->shm_numa_layout[my_numa_node].base_addr

Steps to reproduce :
Versions :

Maia (https://github.com/onera/maia) hash c2cc8d606384f457a6cf771065680f8ea48c8fa4
Pytest_parallel hash ac63c3c6aeeb0747bfd1521f6e59ac7dbaf71130
Intel MPI : Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32
gcc (GCC) 10.2.0
cmake -S $MAIA_ROOT -DCMAKE_CXX_STANDARD=20 -DCMAKE_BUILD_TYPE=Debug -DPDM_ENABLE_TESTS=OFF -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=OFF
mpirun -np 4 python -u -m pytest -s -vv $MAIA_ROOT/maia

Notes :

  • Difficult to isolate a specific test case, it seems that we need a lot of test cases (~500) for it to fail.
  • Maybe a bug in one or several test cases, but seems to come rather from an MPI problem
  • No problem on other machines, other MPI versions... except maybe our dev cluster (same MPI version, but only triggered when launched through non-exclusive SLURM job)
  • The same error message is mentionned here : https://community.intel.com/t5/Intel-MPI-Library/MPI-program-aborts-with-an-quot-Assertion-failed-in-file-ch4-shm/td-p/1370537/page/2 and the answer is : corrected in Intel MPI 2023.2
  • The test suite works with the static and dynamic schedulers, only the sequential one causes trouble
  • Adding time.sleep(0.1) in pytest_pyfunc_call to slightly change the concurrency does not change anything
  • adding gc.collect() in pytest_runtest_protocol or on the contrary gc.disable() at the beginning does not change anything
  • Found a way to rewrite the sequential scheduler to avoid the bug => patch coming soon

Complete error message :

/[...]/maia/maia/my_debug/transfer/test/test_te_utils.py::test_create_all_elt_distribution[2] Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2279: comm->shm_numa_layout[my_numa_node].base_addr
/[...]/intel/oneapi/mpi/2021.6.0/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x146be339852c]
/[...]/intel/oneapi/mpi/2021.6.0/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x146be2d1cc91]
/[...]/intel/oneapi/mpi/2021.6.0/lib/release/libmpi.so.12(+0x264cc6) [0x146be2a57cc6]
/[...]/intel/oneapi/mpi/2021.6.0/lib/release/libmpi.so.12(+0x1b376c) [0x146be29a676c]
/[...]/intel/oneapi/mpi/2021.6.0/lib/release/libmpi.so.12(+0x185be9) [0x146be2978be9]
/[...]/intel/oneapi/mpi/2021.6.0/lib/release/libmpi.so.12(+0x165780) [0x146be2958780]
/[...]/intel/oneapi/mpi/2021.6.0/lib/release/libmpi.so.12(+0x267495) [0x146be2a5a495]
/[...]/intel/oneapi/mpi/2021.6.0/lib/release/libmpi.so.12(MPI_Barrier+0x254) [0x146be2937444]
/[...]/python/3.8.14-intel2220-hpc/lib/python3.8/site-packages/mpi4py/MPI.cpython-38-x86_64-linux-gnu.so(+0x15f16f) [0x146bb914616f]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(+0xc8e4a) [0x146be4d60e4a]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(+0x247376) [0x146be4edf376]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0xc1a) [0x146be4ed969a]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x101) [0x146be4ea3a61]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(+0x247376) [0x146be4edf376]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x20c9) [0x146be4edab49]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x101) [0x146be4ea3a61]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(+0x20c6ce) [0x146be4ea46ce]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(PyVectorcall_Call+0x62) [0x146be4d5ae12]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x1adb) [0x146be4eda55b]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x219) [0x146be4ed81f9]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x1f9) [0x146be4ea3b59]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(+0x247376) [0x146be4edf376]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0xc94) [0x146be4ed9714]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x101) [0x146be4ea3a61]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(+0x20c61d) [0x146be4ea461d]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(+0x247376) [0x146be4edf376]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0xc94) [0x146be4ed9714]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x219) [0x146be4ed81f9]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x1f9) [0x146be4ea3b59]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyObject_FastCallDict+0xc3) [0x146be4ea3853]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(_PyObject_Call_Prepend+0x57) [0x146be4ea3727]
/[...]/python/3.8.14-intel2220-hpc/lib/libpython3.8.so.1.0(+0x10b598) [0x146be4da3598]
Abort(1) on node 0: Internal error

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 2424618 RUNNING AT sator5
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 2424619 RUNNING AT sator5
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 2424620 RUNNING AT sator5
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

@maugarsb @couletj

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@BerengerBerthoul and others