-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with mpi-conduit+PSHM in CI #8
Comments
I have set up some new branches with PSHM re-enabled here:
The Legion repo is running through CI here: https://gitlab.com/StanfordLegion/legion/-/pipelines/691729115 You can reproduce this locally on any Linux machine with Docker. Assuming the same job fails as before, try:
You can do this in any reasonably recent version of Legion. It will check out a new Legion repo in this branch, so you can be in |
@elliottslaughter thanks for the pointers. As mentioned in my original comment, I'm reasonably confident that the puny default |
@elliottslaughter Also, I don't see where in the scripting to update the GASNet version backing the tests. The pipelines you just started are still using the ancient GASNet-2021.3.0 release. |
Our CI has the gasnet version frozen at 2021.3.0 for reasons I don't remember, and are probably not |
@streichler Was it configured on our runner? That would explain why I couldn't find the setting in the repo. I think some sort of pinning makes sense, but it seems better to me to put it in the repo than somewhere outside of it. |
You can set |
Ok, here we are finally with a job running 2022.9.0: https://gitlab.com/StanfordLegion/legion/-/jobs/3308578332 (It's not pinned yet, but if we merge this back to master I'll look into that.) There is a system for passing flags to Docker via Gitlab Runner, but I need to reacquaint myself with what it is. I'll need to go digging into the Gitlab Runner documentation to find it. (Or if someone wants to beat me to it, be my guest.) |
I guess I have a more general concern with a gasnet config that depends on an shm size setting that's not standard on some systems/containers and results in a fatal error whose message doesn't explain the corrective action. @bonachea is there any way to turn this into a warning similar to the "you're using the mpi conduit when we can see that you've got something better available" one we know and love? |
@streichler said:
The current error message is IMHO already pretty self-explanatory:
This particular message happens in Note the details of why there's not enough shared-memory space available at that instant may be complicated (kernel misconfiguration, concurrent use by MPI in this job, concurrent use by other jobs, etc.) or even impossible to determine without admin access; which is why we don't attempt to diagnose a root cause further programmatically. In this particular case the evidence points to a root cause that is a combination of a paltry docker config and MPI library's utilization. The details of how to resolve the problem (best corrective action) could also be highly situation/system-dependent (e.g. changing the It's also worth noting this is not the most common failure mode we see for this root cause. If the shm space in this docker container had been just a bit larger (or MPI was using a bit less shm space), the behavior would instead be that FWIW any production computing center worth its salt will have already grown the kernel-imposed shm size setting to a reasonable value, because it also impacts the usability and performance of MPI shared-memory transport for the same reasons. We've chosen to address this (complicated) topic with a long section in the GASNet README (see "GASNet inter-Process SHared Memory (PSHM)"), which is definitely way too much information for any console error message. |
I agree that production computing centers are unlikely to see this issue. I'm much more worried about the novice Legion user (or perhaps the novice user of a something that happens to sit on top of Legion) having their first experience with multi-process execution be an error message, a trip into GASNet readme files, and maybe ultimately a fix they can't even apply because they lack root privileges. Enabling pshm by default for the mpi conduit exposes them to this risk, and the only benefit is slightly improved performance for a configuration that loudly declares it's not interested in performance. For "high-speed" conduits (ibv, aries, ofi, ucx), I think I'm still convinced that enabling pshm is a net improvement, as those already make similar sorts of demands on the system configuration, but I want to change my vote for the mpi conduit to have it remain disabled. |
Hi @streichler - I think I mostly agree with your assessment, but I want to push back on one detail. mpi-conduit exists primarily for portability, for while you can almost always get better network performance from another conduit, it's also the conduit with the fewest build and spawn dependencies to satisfy (assuming you already have an MPI-capable system) and the one with the smoothest runtime interoperability with MPI calls from application code or other libraries (because it's all just MPI). However there's one exception where mpi-conduit is not a terrible thing to do from a performance standpoint; specifically the important special case of single-node operation with GASNet How would you feel about splitting
|
Thanks for information. I'm not sure we ever care about MPI conduit performance. We really only use it for testing, and I think for anything where we care about performance, we would recommend either a high-performance conduit or UDP if truly nothing better is available. For single-node runs, we usually have users disable GASNet entirely just to simplify the configuration. I suppose the only configuration where might want this is on multi-socket systems where Legion benefits from using a process per NUMA domain, but in the vast majority of these cases we're at a facility that also has a high-performance conduit available, so in practice I don't think we ever hit a case where we'd need MPI conduit. Given these tradeoffs, and given that we would like to be able to replicate our Docker test results on machines where we don't have direct control (e.g., if a contributor is trying to replicate an issue), I think it's best to keep PSHM off for MPI and deal with this on a case-by-case basis if someone really needs it. Does that seem fair? The part I'm less sure about is the sister issue for UCX #7, where we do actually want this to be a high-performance option in some cases and there is a stronger case for finding a way to enable PSHM. |
We seem to be in agreement that the root cause of the problem seen in the CI is trying to use PSHM in a docker container that wasn't configured to provide sufficient POSIX shared memory for simultaneous use by both GASNet and the MPI library. This same problem could theoretically also occur on a bare-metal system, although a well-managed HPC system will be configured for plenty of shared memory (and in any case, mpi-conduit is almost never the best choice for inter-node comms). Given this issue tracker entry now exists as "documentation" for the limitations and tradeoffs, I'm fine resolving this as "no change", since that seems to be the consensus of the Realm devs. |
Background
This issue is forked from issue #6, where the GASNet-EX configure default of
--enable-pshm
was restored for most Realm build configurations in 7a073d3, thereby enabling GASNet's efficient shared-memory transport, which provides huge speedups for intranode comms when running multiple processes-per-node.Unfortunately initial CI testing with ucx-conduit+PSHM in CI led to some new failures (log 1 log 2), and as a result PSHM support was quickly re-disabled for the mpi-conduit configuration in e0b5ff4. This issue exists to triage and hopefully solve the CI failures, so the PSHM enable can be restored in configs/config.mpi.release.
It's worth noting that mpi-conduit is not considered a "native" GASNet conduit, and primarily for portability reasons. mpi-conduit has the advantage of a narrow and easily met set of required dependencies (and allows more flexible interleaving with MPI calls), but the major downside that mpi-conduit is routinely outperformed at a large margin by the native GASNet conduits (e.g. ibv-conduit on InfiniBand, aries-conduit on Cray XC, ofi-conduit on SlingShot, etc). IOW mpi-conduit should be considered the "conduit of last resort", and users running Legion in production should always favor use of an appropriate native conduit instead. That being said, it still has some value as a testing reference, by providing the most hardware-independent GASNet behavior, so it would be nice to fix whatever is breaking the PSHM support in Realm's CI.
Initial requests:
GASNET_VERSION
to the current GASNet release, so I'm guessing this an accidental oversight in the CI scripting. There have been non-trivial improvements made to the PSHM internals since 2021.3.0, so can we please re-run against the current GASNet-EX 2022.9.0 release to avoid potentially wasting time triaging already-fixed defects?GASNET_BACKTRACE=1
to get backtraces? This should greatly help us narrow down the failure locations which can help give insight into the details of what's happening.mpi-conduit/startup failure mode
It's hard to be sure, but there might be more than one failure mode in the mpi-conduit+PSHM pipeline logs provided so far (log 1 log 2). Also there's alot of diagnostic information missing because we're lacking envvar
GASNET_BACKTRACE=1
. However I see a theme in the following failures:My best guess here is that Legion's CI container lacks sufficient space in the
/dev/shm
file system to accommodate the sum of whatever the MPI library might be using (with--enable-pshm
, the MPICH shared memory transport should only be used by mpi-conduit during job setup and teardown), and the GASNet-created shared segments (primordial and aux/conduit-internal). The behavioral difference you saw likely arose because in--disable-pshm
mode the GASNet memory segments don't consume any /dev/shm space, but in the (default) POSIX mode of--enable-pshm
they do. When I run the Legion CI docker container locally on a host system with plenty of/dev/shm
I see a/dev/shm
of 64MB inside the docker container. This is quite small for realistic/effective use of PSHM, and I think confirms this diagnosis.Recommendation: Estimate the necessary segment requirements for the tests, increase the
/dev/shm
file system in the container to matchI found that adding
--shm-size=2g
to thedocker run
command was sufficient to get a usable/dev/shm
filesystem in the Legion CI docker container. The paltry 64MB defaultshm_size
can apparently also be changed at docker container build time.If for some reason
/dev/shm
cannot be increased then GASNet's System V PSHM transport would be the next thing to try (i.e.--disable-pshm-posix --enable-pshm-sysv
), however that usually also requires kernel configuration to ensure sufficient SystemV shared memory, since most distros default to tiny limits.Relevant PSHM docs copied from GASNet README:
System Settings for POSIX Shared Memory:
On most systems (all but Solaris and some cross-compiled platforms), the
default implementation of PSHM uses POSIX shared memory. On many operating
systems the amount of available POSIX shared memory is controlled by the
sizing of a pseudo-filesystem that consumes space in memory (and sometimes
swap space) rather than stable storage. If this filesystem is not large
enough, it can limit the amount of POSIX shared memory which can be allocated
for the GASNet segment.
Insufficient available POSIX shared memory may either lead to failures at
start-up, or to SIGBUS or SIGSEGV later in a run (if the OS has permitted
allocation of more virtual address space than is actually available). If one
encounters either of these failure modes when GASNet is configured to use
POSIX shared memory, then one should check the space available (see below) and
may need to increase the corresponding system settings. Setting these
parameters is system-specific and requires administrator privileges.
On most modern Linux distribution, POSIX shared memory allocations reside in
the /dev/shm filesystem (of type 'tmpfs'), though /var/shm and /run/shm are
also used in some cases. The mechanism for sizing of this filesystem varies
greatly between distributions. Please consult the documentation for your
specific Linux distribution for instructions to resize. In particular, be
advised that distributions may mount this filesystem early in the boot
process, without regards to any entry in /etc/fstab.
CC: @streichler @elliottslaughter @PHHargrove
The text was updated successfully, but these errors were encountered: