Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime failures in XPMEM when running MVAPICH2X or OpenMPI+UCX on POWER9 system #37

Open
jahanzeb-hashmi opened this issue Nov 6, 2020 · 2 comments

Comments

@jahanzeb-hashmi
Copy link

We recently tried XPMEM build of MVAPICH2X and OpenMPI+UCX on POWER9 system. However, we are seeing issues when running. Please see below the details and the reproducer.

XPMEM version: https://github.com/hjelmn/xpmem as of cae86010097cd85f0e749736dc86850f85f7edbc

  • UCX verion:
commit c22daf4fb1e408aedf5a7dc11ed72f87c0f27cc9
Merge: d669d54 e07fd32
Author: Yossi Itigin <[email protected]>
Date:   Fri Nov 6 17:17:08 2020 +0200

    Merge pull request #5881 from brminich/topic/iodemo_rt_exceeded

    TEST/IODEMO/AZP: Fix client tmo option in IODEMO
  • UCX configuration:
--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations --with-xpmem=/opt/xpmem

*OpenMPI: Tarball 4.0.5

  • OpenMPI configuration:
 --with-ucx=/home/users/hashmij/xpmem-work/ucx/install --without-verbs
  • Kernel version
$ uname -r 
4.14.0-115.18.1.el7a.ppc64le
  • System details
$ lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0-159
Thread(s) per core:    4
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          6
Model:                 2.2 (pvr 004e 1202)
Model name:            POWER9, altivec supported
CPU max MHz:           3800.0000
CPU min MHz:           2300.0000
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              10240K
NUMA node0 CPU(s):     0-79
NUMA node8 CPU(s):     80-159
NUMA node252 CPU(s):
NUMA node253 CPU(s):
NUMA node254 CPU(s):
NUMA node255 CPU(s):
  • Reproducer

Build osu-microbenchmarks with OpenMPI built with UCX (with xpmem transport support).

./configure CC=mpicc CXX=mpicxx

Run basic osu_latency test

$ OMPI_DIR/bin/mpirun -np 2 -x UCX_TLS=self,sm ./install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
  • UCX output
[gorgon:39733:0:39733] Caught signal 7 (Bus error: nonexistent physical address)
[gorgon:39732:0:39732] Caught signal 7 (Bus error: nonexistent physical address)

/home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c: [ uct_mm_ep_get_remote_seg() ]
      ...
       81
       82     /* slow path - attach new segment */
       83     return uct_mm_ep_attach_remote_seg(ep, seg_id, length, address_p);
==>    84 }
       85
       86
       87 /* send a signal to remote interface using Unix-domain socket */


/home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c: [ uct_mm_ep_get_remote_seg() ]
      ...
       81
       82     /* slow path - attach new segment */
       83     return uct_mm_ep_attach_remote_seg(ep, seg_id, length, address_p);
==>    84 }
       85
       86
       87 /* send a signal to remote interface using Unix-domain socket */

==== backtrace (tid:  39733) ====
 0 0x0000000000056410 ucs_debug_print_backtrace()  /home/users/hashmij/xpmem-work/ucx/src/ucs/debug/debug.c:656
 1 0x0000000000019b3c uct_mm_ep_get_remote_seg()  /home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c:84
 2 0x0000000000019d38 uct_mm_ep_t_new()  /home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c:194
 3 0x0000000000015558 uct_ep_create()  /home/users/hashmij/xpmem-work/ucx/src/uct/base/uct_iface.c:550
 4 0x000000000006598c ucp_wireup_connect_lane_to_iface()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:805
 5 0x000000000006598c ucp_wireup_connect_lane()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:888
 6 0x000000000006598c ucp_wireup_init_lanes()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:1207
 7 0x0000000000022448 ucp_ep_create_to_worker_addr()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:421
 8 0x00000000000236f0 ucp_ep_create_api_to_worker_addr()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:674
 9 0x00000000000236f0 ucp_ep_create()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:740
10 0x0000000000005ca4 mca_pml_ucx_add_proc_common()  pml_ucx.c:0
11 0x0000000000005f68 mca_pml_ucx_add_procs()  ???:0
12 0x0000000000117bb4 ompi_mpi_init()  ???:0
13 0x00000000000ac1b8 MPI_Init()  ???:0
14 0x0000000010001510 main()  /home/users/hashmij/xpmem-work/osu_benchmarks-ompi/mpi/pt2pt/osu_latency.c:37
15 0x0000000000025200 generic_start_main.isra.0()  libc-start.c:0
16 0x00000000000253f4 __libc_start_main()  ???:0
=================================
[gorgon:39733] *** Process received signal ***
[gorgon:39733] Signal: Bus error (7)
[gorgon:39733] Signal code:  (-6)
[gorgon:39733] Failing at address: 0x3cab00009b35
[gorgon:39733] [ 0] [0x7fffb51004d8]
[gorgon:39733] [ 1] /home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(+0x19b3c)[0x7fffb1bc9b3c]
[gorgon:39733] [ 2] /home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(uct_mm_ep_t_new+0x68)[0x7fffb1bc9d38]
[gorgon:39733] [ 3] /home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(uct_ep_create+0x78)[0x7fffb1bc5558]
[gorgon:39733] [ 4] ==== backtrace (tid:  39732) ====
/home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x8bc)[0x7fffb1c7598c]
[gorgon:39733] [ 5]  0 0x0000000000056410 ucs_debug_print_backtrace()  /home/users/hashmij/xpmem-work/ucx/src/ucs/debug/debug.c:656
 1 0x0000000000019b3c uct_mm_ep_get_remote_seg()  /home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c:84
 2 0x0000000000019d38 uct_mm_ep_t_new()  /home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c:194
 3 0x0000000000015558 uct_ep_create()  /home/users/hashmij/xpmem-work/ucx/src/uct/base/uct_iface.c:550
 4 0x000000000006598c ucp_wireup_connect_lane_to_iface()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:805
 5 0x000000000006598c ucp_wireup_connect_lane()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:888
 6 0x000000000006598c ucp_wireup_init_lanes()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:1207
/home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_ep_create_to_worker_addr+0x98)[0x7fffb1c32448]
[gorgon:39733] [ 6] /home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_ep_create+0x6f0)[0x7fffb1c336f0]
[gorgon:39733] [ 7] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/openmpi/mca_pml_ucx.so(+0x5ca4)[0x7fffb1cf5ca4]
[gorgon:39733]  7 0x0000000000022448 ucp_ep_create_to_worker_addr()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:421
 8 0x00000000000236f0 ucp_ep_create_api_to_worker_addr()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:674
 9 0x00000000000236f0 ucp_ep_create()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:740
10 0x0000000000005ca4 mca_pml_ucx_add_proc_common()  pml_ucx.c:0
[ 8] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_add_procs+0x98)[0x7fffb1cf5f68]
[gorgon:39733] [ 9] 11 0x0000000000005f68 mca_pml_ucx_add_procs()  ???:0
12 0x0000000000117bb4 ompi_mpi_init()  ???:0
13 0x00000000000ac1b8 MPI_Init()  ???:0
14 0x0000000010001510 main()  /home/users/hashmij/xpmem-work/osu_benchmarks-ompi/mpi/pt2pt/osu_latency.c:37
15 0x0000000000025200 generic_start_main.isra.0()  libc-start.c:0
16 0x00000000000253f4 __libc_start_main()  ???:0
=================================
/home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/libmpi.so.40(ompi_mpi_init+0xc54)[0x7fffb50a7bb4]
[gorgon:39733] [10] [gorgon:39732] *** Process received signal ***
[gorgon:39732] Signal: Bus error (7)
[gorgon:39732] Signal code:  (-6)
[gorgon:39732] Failing at address: 0x3cab00009b34
[gorgon:39732] [ 0] [0x7fffbc4704d8]
[gorgon:39732] [ 1] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/libmpi.so.40(MPI_Init+0x98)[0x7fffb503c1b8]
[gorgon:39733] [11] ./install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x10001510]
/home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(+0x19b3c)[0x7fffb8f39b3c]
[gorgon:39732] [ 2] [gorgon:39733] [12] /home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(uct_mm_ep_t_new+0x68)[0x7fffb8f39d38]
[gorgon:39732] [ 3] /home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(uct_ep_create+0x78)[0x7fffb8f35558]
[gorgon:39732] [ 4] /lib64/libc.so.6(+0x25200)[0x7fffb4ab5200]
[gorgon:39733] [13] /home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x8bc)[0x7fffb8fe598c]
[gorgon:39732] [ 5] /home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_ep_create_to_worker_addr+0x98)[0x7fffb8fa2448]
[gorgon:39732] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x7fffb4ab53f4]
[gorgon:39733] *** End of error message ***
/home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_ep_create+0x6f0)[0x7fffb8fa36f0]
[gorgon:39732] [ 7] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/openmpi/mca_pml_ucx.so(+0x5ca4)[0x7fffb9065ca4]
[gorgon:39732] [ 8] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_add_procs+0x98)[0x7fffb9065f68]
[gorgon:39732] [ 9] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/libmpi.so.40(ompi_mpi_init+0xc54)[0x7fffbc417bb4]
[gorgon:39732] [10] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/libmpi.so.40(MPI_Init+0x98)[0x7fffbc3ac1b8]
[gorgon:39732] [11] ./install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x10001510]
[gorgon:39732] [12] /lib64/libc.so.6(+0x25200)[0x7fffbbe25200]
[gorgon:39732] [13] /lib64/libc.so.6(__libc_start_main+0xc4)[0x7fffbbe253f4]
[gorgon:39732] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node gorgon exited on signal 7 (Bus error).
--------------------------------------------------------------------------
  • Kernel Log
$ dmesg | tail
...
[6890564.268944] xpmem_fault_handler: pfn mismatch: 466880 != 1930663
[6890564.269032] xpmem_fault_handler: pfn mismatch: 2088245 != 1302727

CC: @shamisp @hjelmn

@jahanzeb-hashmi jahanzeb-hashmi changed the title Runtime failures when running MVAPICH2X and OpenMPI+UCX on POWER9 system Runtime failures in XPMEM when running MVAPICH2X or OpenMPI+UCX on POWER9 system Nov 6, 2020
@hjelmn
Copy link
Collaborator

hjelmn commented Nov 21, 2020

Hmm, will take a look shortly.

@jahanzeb-hashmi
Copy link
Author

Thanks @hjelmn. Let me know if you need any help with reproducing the issue.

tzafrir-mellanox pushed a commit to tzafrir-mellanox/xpmem that referenced this issue Sep 11, 2024
Update module version from 2.7.0 to 2.7.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants