Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems running oshmpi with Fujitsu MPI on Fugaku #105

Open
tonycurtis opened this issue Mar 12, 2021 · 5 comments
Open

Problems running oshmpi with Fujitsu MPI on Fugaku #105

tonycurtis opened this issue Mar 12, 2021 · 5 comments

Comments

@tonycurtis
Copy link

tonycurtis commented Mar 12, 2021

I can build, but this is what I see on a compute node. Any idea?

(gdb) cont
Continuing.
[New Thread 0x4000025ff010 (LWP 14149)]
[New Thread 0x4000029ff010 (LWP 14150)]

Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
ompi_mfh_base_real_t_cvar_write () at pcvar_write.c:43
43	pcvar_write.c: No such file or directory.
(gdb) bt
#0  ompi_mfh_base_real_t_cvar_write () at pcvar_write.c:43
#1  ompi_mfh_ptl_t_cvar_write ()
    at ../../../../src/ompi/mca/mfh/ptl/mfh_ptl_call.h:692
#2  PMPI_T_cvar_write ()
    at ../../../../src/ompi/mca/mfh/base/mfh_base_func_defs.h:13523
#3  0x00004000000a62a8 in set_mpit_cvar (cvar_name=<optimized out>,
    val=<optimized out>) at ../oshmpi-git/src/internal/setup_impl.c:698
#4  0x00004000000a6354 in initialize_mpit ()
    at ../oshmpi-git/src/internal/setup_impl.c:708
#5  0x00004000000a65d4 in OSHMPI_initialize_thread (required=<optimized out>,
    provided=<optimized out>) at ../oshmpi-git/src/internal/setup_impl.c:780
#6  0x00004000000b24a0 in shmem_init () at ../oshmpi-git/src/shmem/setup.c:13
#7  0x0000000000400ee4 in main () at hello.c:64
(gdb) q
A debugging session is active.

	Inferior 1 [process 14143] will be detached.

Quit anyway? (y or n) y
Detaching from program: /vol0004/ra010008/XXXXXX/shmem/openshmem-examples/c/a.out, process 14143
[Inferior 1 (process 14143) detached]
[c34-0003c:14143] *** Process received signal ***
[c34-0003c:14143] Signal: Segmentation fault (11)
[c34-0003c:14143] Signal code: Address not mapped (1)
[c34-0003c:14143] Failing at address: 0x1
[c34-0003c:14143] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x40000006066c]
[c34-0003c:14143] [ 1] /opt/FJSVxtclanga/tcsds-1.2.30a/lib64/libmpi.so.0(PMPI_T_cvar_write+0x54)[0x40000023d574]
[c34-0003c:14143] [ 2] /home/ra010008/XXXXXX/opt/oshmpi/git/lib/liboshmpi.so.0(+0x162a8)[0x4000000a62a8]
[c34-0003c:14143] [ 3] /home/ra010008/XXXXXXopt/oshmpi/git/lib/liboshmpi.so.0(+0x16354)[0x4000000a6354]
[c34-0003c:14143] [ 4] /home/ra010008/XXXXXX/opt/oshmpi/git/lib/liboshmpi.so.0(OSHMPI_initialize_thread+0x270)[0x4000000a65d4]
[c34-0003c:14143] [ 5] /home/ra010008/XXXXXX/opt/oshmpi/git/lib/liboshmpi.so.0(shmem_init+0x24)[0x4000000b24a0]
[c34-0003c:14143] [ 6] ./a.out[0x400ee4]
[c34-0003c:14143] [ 7] /lib64/libc.so.6(__libc_start_main+0xe4)[0x400001030be4]
[c34-0003c:14143] [ 8] ./a.out[0x400dfc]
[c34-0003c:14143] *** End of error message ***
@minsii
Copy link
Collaborator

minsii commented May 6, 2021

@tonycurtis Sorry I somehow did not get notification for this issue. Can you try #109 when get a chance? I am not sure if it resolves the issue, but it was an obvious bug in OSHMPI.

@tonycurtis
Copy link
Author

Same problem, unfortunately.

@minsii
Copy link
Collaborator

minsii commented May 7, 2021

@tonycurtis Can you please try #112 ? Set environment variable OSHMPI_ENABLE_MPI_T=0 when you run. E.g.,

OSHMPI_VERBOSE=1 OSHMPI_ENABLE_MPI_T=0 mpiexec -np 2 ./hello

It disables the MPI_T code.

@tonycurtis
Copy link
Author

Same problem

@minsii
Copy link
Collaborator

minsii commented May 7, 2021

It should no longer run the set_mpit_cvar function. I just added one debug message to the above PR.

Would you mind updating the code and run again? Please copy the output here with OSHMPI_VERBOSE=1 OSHMPI_ENABLE_MPI_T=0 SHMEM_DEBUG=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants