Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shm failure with PSM2 #48

Open
adrianjhpc opened this issue Mar 24, 2020 · 21 comments
Open

Shm failure with PSM2 #48

adrianjhpc opened this issue Mar 24, 2020 · 21 comments

Comments

@adrianjhpc
Copy link

Running using Intel MPI and PSM2 on a dual rail Omnipath network we're getting these errors with some applications:

Error opening remote shared memory object in shm_open: No such file or directory (err=9)
PSM could not set up shared memory segment (err=9)

When we look in /dev/shm we see psm2_shm.295510000000000020e02 type files, but it is still failing. We've tried cleaning up /dev/shm but it does not seem to help.

We've seen this for PSM2 10.3.46, 11.2.23, 11.2.77, and 11.2.78.

Any idea what's going wrong?

@mwheinz
Copy link

mwheinz commented Mar 24, 2020

Adrian,

You're problem doesn't ring any bells, but I've opened an internal bug report for it. Could you give me a little more info? What version of IFS are you using and on what distro you're using it?

@mwheinz
Copy link

mwheinz commented Mar 24, 2020

Also - which MPI you're using; if you could provide the mpirun command line it would help us understand what might be going on.

@adrianjhpc
Copy link
Author

Thanks.

CentOS Linux release 7.5.1804
Intel(R) MPI Library, Version 2019 Update 3 Build 20190214 (id: b645a4a54)
For MPI run stuff:

export FI_PROVIDER=psm2
export PSM2_MULTIRAIL=1
export PSM2_MULTIRAIL_MAP=0:1,1:1
export PSM2_MULTI_EP=1
export PSM2_DEVICES=self,shm,hfi
export OMP_NUM_THREADS=2
mpirun -genvall -n 960 -ppn 48 ...

Can you "remind me" how to get the IFS version?

@mwheinz
Copy link

mwheinz commented Mar 24, 2020

opaconfig -V should do it.

@adrianjhpc
Copy link
Author

opaconfig -V reports:

10.8.0.0.204

@mwheinz
Copy link

mwheinz commented Mar 24, 2020

Thanks. Are you using the version of PSM2 that comes packaged with Intel MPI or the upstream version?

@adrianjhpc
Copy link
Author

For this it was the PSM2 with Intel MPI but we do have other versions installed on the system.

@mwheinz
Copy link

mwheinz commented Mar 24, 2020

okay.

@mwheinz
Copy link

mwheinz commented Mar 25, 2020

Adrian, thinking about it, We've never tested PSM2 in conjunction with OMP and we don't provide strong protections for using PSM2 in a multi-threaded environment. Does the problem still exist if you set export OMP_NUM_THREADS=1?

@adrianjhpc
Copy link
Author

I can check. I should say that we're not doing any MPI from within OpenMP regions, but I'll check nevertheless.

@adrianjhpc
Copy link
Author

Using a single OpenMP thread doesn't help I'm afraid. How would you suggest I debug the issue, can build my own PSM2 source and modify the part failing to see what's going wrong?

Here's a stack trace of the current failure (or at least the relevant part):

#0 pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
#1 0x00002acf2dd7ab00 in psmi_amsh_short_request () from /lib64/libpsm2.so.2
#2 0x00002acf2dd79bdf in amsh_ep_connreq_poll () from /lib64/libpsm2.so.2
#3 0x00002acf2dd7bd4b in amsh_ep_connect () from /lib64/libpsm2.so.2
#4 0x00002acf2dd8aed6 in psm2_ep_connect () from /lib64/libpsm2.so.2

@adrianjhpc
Copy link
Author

adrianjhpc commented Apr 5, 2020

Debugging PSM2 a bit, the error is happening in this function:

psm2_error_t psmi_shm_create(ptl_t *ptl_gen)

It's this bit of code that's failing:

    for (iterator = 0; iterator <= INT_MAX; iterator++) {
            snprintf(shmbuf,
                     sizeof(shmbuf),
                     "/psm2_shm.%ld%016lx%d",
                     (long int) getuid(),
                     epid,
                     iterator);
            dest_shmfd = shm_open(shmbuf, O_RDWR, S_IRWXU);
            if (dest_shmfd < 0) {
                    if (errno == EACCES && iterator < INT_MAX)
                            continue;
                    else {
                            err = psmi_handle_error(NULL,
                                                    PSM2_SHMEM_SEGMENT_ERR,
                                                    "Error opening remote "
                                                    "shared memory object "
                                                    "in shm_open: %s",
                                                    strerror(errno));

                    goto fail;
            }
            shmfd =
                shm_open(amsh_keyname, O_RDWR, S_IRUSR | S_IWUSR);

Where it is looking for a specific psm2 file in /dev/shm that isn't on the current host but is on a remote host (I can find it by searching all the /dev/shm on the hosts that have been used for the run). For instance, this failed on node 22 but the file it failed on (psm2_shm.2955100000000001624020) was on node 17.

@mwheinz
Copy link

mwheinz commented Apr 5, 2020

Okay - try a workaround. in your mpirun line add

-X PSM2_DEVICES=self,hfi

That will disable the shm device. I have no explanation for why a machine would be trying to open a shared memory handle on a different machine.

@adrianjhpc
Copy link
Author

It definitely works if we disable the shm device, we're just trying to get shm to work for better performance.

@mwheinz
Copy link

mwheinz commented Apr 15, 2020

It definitely works if we disable the shm device, we're just trying to get shm to work for better performance.

Adrian, I know it's been 10 days, I just wanted to let you know we are looking at this.

@mwheinz
Copy link

mwheinz commented Apr 15, 2020

We have some ideas, but we were wondering if you could try adding the following to a test run:

-x PSM2_TRACEMASK=0x40 -x HFI_DEBUG_FILENAME="/tmp/%h.%p.out"

This will generate a ton of output in the .out files, but the contents should tell us if different machines are really trying to communicate over SHM.

@adrianjhpc
Copy link
Author

Thanks for looking into this. I'll try that out and let you now what it produces.

@adrianjhpc
Copy link
Author

I appreciate some time has passed, but I have had some time to get back and play with PSM2 to find out where the problem is occurring.

I've isolated it (with an OpenMPI application using PSM2) to the function psmi_shm_map_remote in the file ptl_am/am_reqrep_shmem.c. (note this was playing with PSM2 11.2.78).

The shm file opening completes correctly, i.e. this works without any error:

            dest_shmfd = shm_open(shmbuf, O_RDWR|O_CREAT|O_TRUNC, S_IRWXU);

The mmap also works, i.e. this works without any error:

    dest_mapptr = mmap(NULL, segsz,  PROT_READ | PROT_WRITE, MAP_SHARED, dest_shmfd, 0);
    dest_nodeinfo = (struct am_ctl_nodeinfo *)dest_mapptr;

However, any attempt to dereference eleents of dest_nodeinfo throws the error, i.e. this is the first place in the function this happens and the program crashes:

volatile uint16_t *is_init = &dest_nodeinfo->is_init;

Does this provide any pointers (apologies for the pun) on what's going wrong?

@BrendanCunningham
Copy link
Contributor

Thanks for the update. Two ideas come to mind:

  1. PSM2_MULTIRAIL=1 is somehow causing this bug.
  2. Shared memory region is being removed out from under remote mapping process after mmap() succeeds but before dest_nodeinfo dereference.

We have some follow-up questions/requests:

  1. How reliably can you reproduce this issue?
  2. Can you share or point us to a reproducer?
  3. Can you run your job with '-x PSM2_TRACEMASK=0x40 -x HFI_DEBUG_FILENAME="/tmp/%h.%p.out"' and provide that output from a failing run?
  4. Can you run your job with '-x PSM2_MULTIRAIL=0' and report if it fails with the same/similar failure?

@adrianjhpc
Copy link
Author

Thanks for the response.

  1. I can reproduce it reliably
  2. I can package up a reproducer if that's useful.
  3. I've attached the output of this
  4. Setting multi rail to 0 doesn't fix the issue.

psm2_debug.txt

@jtfrey
Copy link

jtfrey commented Mar 25, 2022

I'm getting this same error, but ONLY when using MPI_Comm_spawn().

Open MPI 4.1.2
libpsm2-10.3.35-1.x86_64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants