Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with mpi-conduit+PSHM in CI #8

Closed
bonachea opened this issue Nov 9, 2022 · 13 comments
Closed

Problems with mpi-conduit+PSHM in CI #8

bonachea opened this issue Nov 9, 2022 · 13 comments

Comments

@bonachea
Copy link
Contributor

bonachea commented Nov 9, 2022

Background

This issue is forked from issue #6, where the GASNet-EX configure default of --enable-pshm was restored for most Realm build configurations in 7a073d3, thereby enabling GASNet's efficient shared-memory transport, which provides huge speedups for intranode comms when running multiple processes-per-node.

Unfortunately initial CI testing with ucx-conduit+PSHM in CI led to some new failures (log 1 log 2), and as a result PSHM support was quickly re-disabled for the mpi-conduit configuration in e0b5ff4. This issue exists to triage and hopefully solve the CI failures, so the PSHM enable can be restored in configs/config.mpi.release.

It's worth noting that mpi-conduit is not considered a "native" GASNet conduit, and primarily for portability reasons. mpi-conduit has the advantage of a narrow and easily met set of required dependencies (and allows more flexible interleaving with MPI calls), but the major downside that mpi-conduit is routinely outperformed at a large margin by the native GASNet conduits (e.g. ibv-conduit on InfiniBand, aries-conduit on Cray XC, ofi-conduit on SlingShot, etc). IOW mpi-conduit should be considered the "conduit of last resort", and users running Legion in production should always favor use of an appropriate native conduit instead. That being said, it still has some value as a testing reference, by providing the most hardware-independent GASNet behavior, so it would be nice to fix whatever is breaking the PSHM support in Realm's CI.

Initial requests:

  1. The provided pipeline log reveals it was built against (1.5 year old) GASNet-EX version 2021.3.0. This is despite recent commit 973d1a5 that sets this repo's Makefile default GASNET_VERSION to the current GASNet release, so I'm guessing this an accidental oversight in the CI scripting. There have been non-trivial improvements made to the PSHM internals since 2021.3.0, so can we please re-run against the current GASNet-EX 2022.9.0 release to avoid potentially wasting time triaging already-fixed defects?
  2. Can we please try re-runs with envvar GASNET_BACKTRACE=1 to get backtraces? This should greatly help us narrow down the failure locations which can help give insight into the details of what's happening.

mpi-conduit/startup failure mode

It's hard to be sure, but there might be more than one failure mode in the mpi-conduit+PSHM pipeline logs provided so far (log 1 log 2). Also there's alot of diagnostic information missing because we're lacking envvar GASNET_BACKTRACE=1. However I see a theme in the following failures:

*** FATAL ERROR: fatal SIGBUS while mapping shared memory
...
*** FATAL ERROR (proc 0): in gasnetc_init() at legion/gasnet/GASNet-2021.3.0/mpi-conduit/gasnet_core.c:213: per-host segment limit 2203648 is too small to accommodate 2 aux segments, total size 4194304. You may need to adjust OS shared memory limits.

My best guess here is that Legion's CI container lacks sufficient space in the /dev/shm file system to accommodate the sum of whatever the MPI library might be using (with --enable-pshm, the MPICH shared memory transport should only be used by mpi-conduit during job setup and teardown), and the GASNet-created shared segments (primordial and aux/conduit-internal). The behavioral difference you saw likely arose because in --disable-pshm mode the GASNet memory segments don't consume any /dev/shm space, but in the (default) POSIX mode of --enable-pshm they do. When I run the Legion CI docker container locally on a host system with plenty of /dev/shm I see a /dev/shm of 64MB inside the docker container. This is quite small for realistic/effective use of PSHM, and I think confirms this diagnosis.

Recommendation: Estimate the necessary segment requirements for the tests, increase the /dev/shm file system in the container to match

I found that adding --shm-size=2g to the docker run command was sufficient to get a usable /dev/shm filesystem in the Legion CI docker container. The paltry 64MB default shm_size can apparently also be changed at docker container build time.

If for some reason /dev/shm cannot be increased then GASNet's System V PSHM transport would be the next thing to try (i.e. --disable-pshm-posix --enable-pshm-sysv), however that usually also requires kernel configuration to ensure sufficient SystemV shared memory, since most distros default to tiny limits.

Relevant PSHM docs copied from GASNet README:

System Settings for POSIX Shared Memory:

On most systems (all but Solaris and some cross-compiled platforms), the
default implementation of PSHM uses POSIX shared memory. On many operating
systems the amount of available POSIX shared memory is controlled by the
sizing of a pseudo-filesystem that consumes space in memory (and sometimes
swap space) rather than stable storage. If this filesystem is not large
enough, it can limit the amount of POSIX shared memory which can be allocated
for the GASNet segment.

Insufficient available POSIX shared memory may either lead to failures at
start-up, or to SIGBUS or SIGSEGV later in a run (if the OS has permitted
allocation of more virtual address space than is actually available). If one
encounters either of these failure modes when GASNet is configured to use
POSIX shared memory, then one should check the space available (see below) and
may need to increase the corresponding system settings. Setting these
parameters is system-specific and requires administrator privileges.

On most modern Linux distribution, POSIX shared memory allocations reside in
the /dev/shm filesystem (of type 'tmpfs'), though /var/shm and /run/shm are
also used in some cases. The mechanism for sizing of this filesystem varies
greatly between distributions. Please consult the documentation for your
specific Linux distribution for instructions to resize. In particular, be
advised that distributions may mount this filesystem early in the boot
process, without regards to any entry in /etc/fstab.

CC: @streichler @elliottslaughter @PHHargrove

@elliottslaughter
Copy link
Contributor

I have set up some new branches with PSHM re-enabled here:

The Legion repo is running through CI here: https://gitlab.com/StanfordLegion/legion/-/pipelines/691729115

You can reproduce this locally on any Linux machine with Docker. Assuming the same job fails as before, try:

python3 tools/gitlab2docker.py -b gasnet-mpi-pshm .gitlab-ci.yml gcc9_cxx11_debug_gasnetex_mpi_cmake_legion -n

You can do this in any reasonably recent version of Legion. It will check out a new Legion repo in this branch, so you can be in master while you do this. It will create a Docker image with everything set up, and you can run it with something like docker run -ti c47... where c47... is the ID of the image that it creates. Once you're in the Docker container, use /script.sh to kick off the whole test suite. That should build, run, and fail, at which point you can poke at specific tests to see what's going on.

@bonachea
Copy link
Contributor Author

@elliottslaughter thanks for the pointers.

As mentioned in my original comment, I'm reasonably confident that the puny default /dev/shm filesystem in the docker container is at least 80% of the problem. How do we modify the docker run command used in CI to add --shm-size=2g or something else reasonable?

@bonachea
Copy link
Contributor Author

@elliottslaughter Also, I don't see where in the scripting to update the GASNet version backing the tests. The pipelines you just started are still using the ancient GASNet-2021.3.0 release.

@streichler
Copy link
Contributor

Our CI has the gasnet version frozen at 2021.3.0 for reasons I don't remember, and are probably not
correct any more. I'm going to remove that, but if you run your own docker container with the technique
@elliottslaughter suggested, you should see 2022.9.0 get used.

@elliottslaughter
Copy link
Contributor

@streichler Was it configured on our runner? That would explain why I couldn't find the setting in the repo.

I think some sort of pinning makes sense, but it seems better to me to put it in the repo than somewhere outside of it.

@streichler
Copy link
Contributor

You can set GASNET_VERSION in .gitlab-ci.yml (which we currently only do for the config that runs against the gasnet development snapshot), and you can override it in the gitlab CI variables section (which we were doing for reasons that are lost to the sands of time).

@elliottslaughter
Copy link
Contributor

Ok, here we are finally with a job running 2022.9.0:

https://gitlab.com/StanfordLegion/legion/-/jobs/3308578332

(It's not pinned yet, but if we merge this back to master I'll look into that.)

There is a system for passing flags to Docker via Gitlab Runner, but I need to reacquaint myself with what it is. I'll need to go digging into the Gitlab Runner documentation to find it. (Or if someone wants to beat me to it, be my guest.)

@streichler
Copy link
Contributor

I guess I have a more general concern with a gasnet config that depends on an shm size setting that's not standard on some systems/containers and results in a fatal error whose message doesn't explain the corrective action. @bonachea is there any way to turn this into a warning similar to the "you're using the mpi conduit when we can see that you've got something better available" one we know and love?

@bonachea
Copy link
Contributor Author

@streichler said:

I have a more general concern with ... fatal error whose message doesn't explain the corrective action

The current error message is IMHO already pretty self-explanatory:

*** FATAL ERROR (proc 0): in gasnetc_init() at legion/gasnet/GASNet-2022.9.0/mpi-conduit/gasnet_core.c:246: per-host segment limit 458752 is too small to accommodate 2 aux segments, total size 4194304. You may need to adjust OS shared memory limits.

This particular message happens in gex_Client_init() where the available POSIX shared memory space is so tiny that we cannot even create the aux segment GASNet needs for its own internal operation. Do you have concrete suggestions on text to add to this message that would be factually accurate not potentially misleading in some circumstances?

Note the details of why there's not enough shared-memory space available at that instant may be complicated (kernel misconfiguration, concurrent use by MPI in this job, concurrent use by other jobs, etc.) or even impossible to determine without admin access; which is why we don't attempt to diagnose a root cause further programmatically. In this particular case the evidence points to a root cause that is a combination of a paltry docker config and MPI library's utilization. The details of how to resolve the problem (best corrective action) could also be highly situation/system-dependent (e.g. changing the docker command or build script, adjusting kernel parameters, tweaking MPI usage, killing runaway jobs, etc), so automating a concrete recommendation is probably also out-of-scope.

It's also worth noting this is not the most common failure mode we see for this root cause. If the shm space in this docker container had been just a bit larger (or MPI was using a bit less shm space), the behavior would instead be that gex_Client_init() completes but advertises a very small gasnet_getMaxGlobalSegmentSize() limit on the primordial segment. Then it would be up to the client (Realm) to determine if that space is too small for its needs and how it wants to complain to the user about that. In other words it's not GASNet's place to determine what a "reasonable" segment size is on behalf of the client; how does Realm currently handle that situation?

FWIW any production computing center worth its salt will have already grown the kernel-imposed shm size setting to a reasonable value, because it also impacts the usability and performance of MPI shared-memory transport for the same reasons. We've chosen to address this (complicated) topic with a long section in the GASNet README (see "GASNet inter-Process SHared Memory (PSHM)"), which is definitely way too much information for any console error message.

@streichler
Copy link
Contributor

I agree that production computing centers are unlikely to see this issue. I'm much more worried about the novice Legion user (or perhaps the novice user of a something that happens to sit on top of Legion) having their first experience with multi-process execution be an error message, a trip into GASNet readme files, and maybe ultimately a fix they can't even apply because they lack root privileges. Enabling pshm by default for the mpi conduit exposes them to this risk, and the only benefit is slightly improved performance for a configuration that loudly declares it's not interested in performance.

For "high-speed" conduits (ibv, aries, ofi, ucx), I think I'm still convinced that enabling pshm is a net improvement, as those already make similar sorts of demands on the system configuration, but I want to change my vote for the mpi conduit to have it remain disabled.

@bonachea
Copy link
Contributor Author

Hi @streichler -

I think I mostly agree with your assessment, but I want to push back on one detail.

mpi-conduit exists primarily for portability, for while you can almost always get better network performance from another conduit, it's also the conduit with the fewest build and spawn dependencies to satisfy (assuming you already have an MPI-capable system) and the one with the smoothest runtime interoperability with MPI calls from application code or other libraries (because it's all just MPI).

However there's one exception where mpi-conduit is not a terrible thing to do from a performance standpoint; specifically the important special case of single-node operation with GASNet --enable-pshm. That "sweet spot" allows you to reap all the interoperability benefits of mpi-conduit, but the steady-state GASNet communication operations will use GASNet's very fast PSHM shared-memory bypass (mpi-conduit won't call MPI to communicate after startup). In --disable-pshm mode, mpi-conduit is forced to call MPI for intranode comms, and rely on MPI's own shared-memory transport (assuming that's properly configured); which is still faster than NIC loopback, but will add alot of unnecessary data copies and overhead on both sides relative to GASNet PSHM.

How would you feel about splitting config.mpi.release to meet these disparate use cases? E.g.: (names TBD)

  1. config.mpi-nopshm.release : The most portable possible setup that should work anywhere MPI works, at a peformance cost. Never recommended for production use.
  2. config.mpi-pshm.release : Provides the best combination of interoperability and performance for single-node runs (where the admin has done the bare minimum to configure the system's POSIX shared memory). Not recommended for production multi-node runs.

@elliottslaughter
Copy link
Contributor

Thanks for information.

I'm not sure we ever care about MPI conduit performance. We really only use it for testing, and I think for anything where we care about performance, we would recommend either a high-performance conduit or UDP if truly nothing better is available. For single-node runs, we usually have users disable GASNet entirely just to simplify the configuration.

I suppose the only configuration where might want this is on multi-socket systems where Legion benefits from using a process per NUMA domain, but in the vast majority of these cases we're at a facility that also has a high-performance conduit available, so in practice I don't think we ever hit a case where we'd need MPI conduit.

Given these tradeoffs, and given that we would like to be able to replicate our Docker test results on machines where we don't have direct control (e.g., if a contributor is trying to replicate an issue), I think it's best to keep PSHM off for MPI and deal with this on a case-by-case basis if someone really needs it.

Does that seem fair?

The part I'm less sure about is the sister issue for UCX #7, where we do actually want this to be a high-performance option in some cases and there is a stronger case for finding a way to enable PSHM.

@bonachea
Copy link
Contributor Author

We seem to be in agreement that the root cause of the problem seen in the CI is trying to use PSHM in a docker container that wasn't configured to provide sufficient POSIX shared memory for simultaneous use by both GASNet and the MPI library. This same problem could theoretically also occur on a bare-metal system, although a well-managed HPC system will be configured for plenty of shared memory (and in any case, mpi-conduit is almost never the best choice for inter-node comms).

Given this issue tracker entry now exists as "documentation" for the limitations and tradeoffs, I'm fine resolving this as "no change", since that seems to be the consensus of the Realm devs.

@bonachea bonachea closed this as not planned Won't fix, can't repro, duplicate, stale Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants