Realm: Failed to send message #1597

mariodirenzo · 2023-11-11T08:49:14Z

I've started seeing the following message when using the current version of shardrefine

[1 - 14bd921b4000]    0.022116 {5}{mailbox}: Failed to send message: 111

[1 - 14bd921b4000]    0.022362 {4}{realm}: Unable to send shared memory information to node 0, skipping

Do you know what may cause it?
Is it harmful to the performance or correctness of the calculation?

@elliottslaughter, can you please add this to #1032?

The text was updated successfully, but these errors were encountered:

elliottslaughter · 2023-11-11T15:40:50Z

I don't remember the answer but @syamajala has been seeing the same thing.

eddy16112 · 2023-11-12T20:48:43Z

These warnings tell we are not able to create shared memory between processes. It is harmful to the performance, but should not affect the correctness. @muraj, do you know why the shared memory failed?

mariodirenzo · 2024-02-03T12:05:23Z

After updating the runtime to a newer version of control replication, the mailbox error disappeared but the Unable to send shared memory information warnings are still there.
Is there anything I can do to help debug this issue?

lightsighter · 2024-02-03T20:01:32Z

Assigning @eddy16112 and @muraj.

eddy16112 · 2024-02-03T21:16:28Z

The mailbox error is gone because we switched it from a warning message into an info message. According to the error message, 111 means connection refused. Are rank 1 and rank 0 on the same physical node?

mariodirenzo · 2024-02-04T15:57:07Z

No, they are on different physical nodes

lightsighter · 2024-02-04T19:30:24Z

I saw this on S3D logs too. They also were one process/node, so there shouldn't have been any warnings about about trying to exchange information between processes; they need to be suppressed unless they are actually real.

eddy16112 · 2024-02-04T20:32:09Z

I reproduced it. It is an issue when creating shared_peers https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnet1/gasnet1_module.cc#L633. According to the comments, we mixed the case that GASNetEX can not detect the shared_peers with the one that there is no shared_peers (like rank per node). I need to think about how to fix it.

eddy16112 · 2024-02-04T21:59:54Z

BTW, you can ignore this warning for now. It won't hurt anything.

elliottslaughter · 2024-02-04T23:35:12Z

@mariodirenzo Are you still using gasnet1? I believe we had a discussion about deprecating it at some point; while I don't think there are any immediate plans, to the extent that you still have issues with gasnetex we should work through those.

eddy16112 · 2024-02-05T00:36:37Z

We are having the same issue even with gasnetex https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc#L3437-3440. I have a new PR https://gitlab.com/StanfordLegion/legion/-/merge_requests/1054 that will improve the accuracy of shared_peers by using ipc mailbox, but in the case that ipc mailbox is failed (even though the chance is very small), we fall back to rely on network modules to report shared_peers, and we will hit this issue again.

mariodirenzo · 2024-02-05T06:00:01Z

@elliottslaughter, these executions were performed with the gasnetex network layer, though gasnet1 is still our default option because of the issues listed at #1508. When the -gex:obcount and -ll:force_kthreads issues are solved, we will be happy to make gasnetex the default network layer for HTR

elliottslaughter · 2024-02-05T17:28:58Z

@eddy16112 The GASNet software itself must identify shared-memory peers in some way, because GASNet provides a portable shared-memory bypass to avoid the NIC when moving data between ranks on a node. Perhaps it would be worth learning from what they do? I don't believe I've seen failures like this when just running GASNet PSHM. Maybe @PHHargrove can comment.

PHHargrove · 2024-02-05T20:42:20Z

@elliottslaughter I am not 100% sure of the context for the question, but here is the portion of the GASNet README describing the environment variable which determines how we identify shared-memory peers:

* GASNET_HOST_DETECT
 To implement gex_System_QueryHostInfo() and to construct shared-memory
 "nbrhds", GASNet must map the hosts (compute nodes) in a job.  This requires
 a unique identifier for each host.  This string-valued setting selects the
 identifier used.
 The following are implemented for most conduits:
   "gethostid" - the 32-bit value returned by POSIX gethostid()
   "hostname" - a 64-bit hash of the hostname (as reported by POSIX gethostname())
 Some conduits support:
   "conduit" - a network-specific identifier (such as a MAC address)
 On networks providing the "conduit" option, conduit-specific documentation will
 describe whether it is supported in addition to the two listed above, or if
 it is the *only* supported option.

eddy16112 · 2024-02-05T20:51:12Z

@PHHargrove we use gex_System_QueryNbrhdInfo to detect which ranks are on the same physical node, if it is failed (e.g. PSHM is disabled), we will use gex_System_QueryHostInfo, but gex_System_QueryHostInfo can still fail if running in containers. Do you know what methods gex_System_QueryNbrhdInfo use? will it always success even in containers as long as PSHM is enabled?

PHHargrove · 2024-02-05T21:28:13Z

Do you know what methods gex_System_QueryNbrhdInfo use?

See above. The quoted text "and to construct shared-memory nbrhds" means the implementation of gex_System_QueryNbrhdInfo().

The "conduit" value is available only for aries- and udp-conduits.

If your containers are (mis)configured to all have the same hostid and identical hostnames, then there is currently no mechanism we can use to distinguish them. However, if either of the two is distinct, then the appropriate setting of the environment variable can be used.

The hostid (used by default) typically derived from the first non-loopback IP address, but can also be set in a configuration file (/etc/hostid, iirc). My guess is that the container image being used has a constant value in such a file. If so, then you should try GASNET_HOST_DETECT=hostname to instead used the (hopefully unique) hostname.

If you have suggestions for other simple identifiers we should consider using, please let me know.

elliottslaughter · 2024-02-06T04:46:05Z

If I understand the original issue and particularly @mariodirenzo's comment at #1597 (comment), this started happening on an actual supercomputer cluster. @mariodirenzo please correct me if I'm wrong, but I do not believe the machine in question uses containers. (@mariodirenzo it might be helpful to know which specific machine we are talking about here.)

Therefore, based on what @PHHargrove said in #1597 (comment), it sounds like either the hostname, the hostid, or both, are identical across both machines. That seems like a challenging situation to be in, and I'm surprised we haven't had issues before, because as far as I know all production usage of Legion enables GASNet PSHM, which would be vulnerable to the same failure modes.

Since we have not heard about this failure from @mariodirenzo previously with respect to PSHM, it makes me wonder if either (a) I have not adequately understood something and the issue is somewhere else, or (b) the Realm code has an additional flaw even beyond what GASNet is doing, which causes Realm to be sensitive to things GASNet is not.

eddy16112 · 2024-02-06T06:12:32Z

That seems like a challenging situation to be in, and I'm surprised we haven't had issues before

We did not need to know which ranks are on the physical nodes before. We now use this information to create shared memory channel to improve the performance of intra-node copies.

If you have suggestions for other simple identifiers we should consider using, please let me know.

We implemented a ipc mailbox using socket with AF_UNIX. If we can successfully send a message via this type of socket between two ranks, it means they are on the same physical nodes. Here is our code https://gitlab.com/StanfordLegion/legion/-/blob/cudaipc_hostname/runtime/realm/runtime_impl.cc#L1548-1584. However, it only works on linux now.

mariodirenzo · 2024-02-06T06:52:19Z

@elliottslaughter, I've been seeing these messages from multi-node runs both on Lassen (https://hpc.llnl.gov/hardware/compute-platforms/lassen) and Leonardo (https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.2.1%3A+LEONARDO+Booster+UserGuide)

elliottslaughter · 2024-02-06T23:13:11Z

A couple of thoughts:

@eddy16112, that code appears to be a in a branch cudaipc_hostname. Is that a branch @mariodirenzo should try out (or possibly a merge of that branch into DCR)? I'm pretty sure all of these production machines are Linux.
We still haven't explained why the Realm code is failing but GASNet PSHM (which has been default for a long time) has been succeeding. Is it worth a test on current control_replication with the Realm shared memory disabled to see if we can reproduce the same problem with GASNet PSHM alone?
Separately from the above, should we have @mariodirenzo check the hostname and host ID of the nodes in question? (I think hostname should do it for the former, not sure what the command would be for the latter.)

eddy16112 · 2024-02-07T00:21:55Z

that code appears to be a in a branch cudaipc_hostname.

It will be merged into the master branch soon.

We still haven't explained why the Realm code is failing but GASNet PSHM (which has been default for a long time) has been succeeding.

It is a false alarm. Realm asks GASNet to report if there is any neighbor ranks (shared_peers) that are on the same physical node with my_rank. Because the application is run with 1 rank per node, so GASNet report 0 neighbor rank. However, because GASNet will also report 0 neighbor rank if gex_System_QueryNbrhdInfo or gex_System_QueryHostInfo is failed, we can not tell why shared_peers is 0 in this case, therefore, Realm decides to set shared_peers to all_peers(all ranks). Later, when creating shared memory channel using all_peers, we will see the warning Unable to send shared memory information to node 0, skipping. With the changes in the cudaipc_hostname branch, we will have a better way to detect shared_peers, so we will not see the warning.

PHHargrove · 2024-02-07T03:24:35Z

@elliottslaughter wrote
...

Separately from the above, should we have @mariodirenzo check the hostname and host ID of the nodes in question? (I think hostname should do it for the former, not sure what the command would be for the latter.)

The hostid utility should be available on most Linux systems. If not, the following should to the trick:

#include <unistd.h>
#include <stdio.h>
int main(void) {
  printf("%lx\n", gethostid());
  return 0;
}

elliottslaughter · 2024-02-07T17:35:29Z

However, because GASNet will also report 0 neighbor rank if gex_System_QueryNbrhdInfo or gex_System_QueryHostInfo is failed

Perhaps these APIs could be tweaked to distinguish between 0 because it really is a rank-per-node job and 0 because of failure? @PHHargrove?

PHHargrove · 2024-02-07T19:30:27Z

Perhaps these APIs could be tweaked to distinguish between 0 because it really is a rank-per-node job and 0 because of failure? @PHHargrove?

Feel free to read the sources for both functions and you will find there are no error cases and a void return type.
Both are just copying some "map" information discovered earlier to caller-provided locations.

In the mapping procedure itself, it is not possible in general to distinguish "failure" to discover neighbors from genuinely having none.

lightsighter · 2024-02-07T22:10:24Z

In the mapping procedure itself, it is not possible in general to distinguish "failure" to discover neighbors from genuinely having none.

Right, and I think this is the source of the Realm warnings at the moment. We definitely shouldn't be warning users unnecessarily. Perhaps if we added a flag to tell Realm that it should expect to find other processes on the same node, then we could issue the warning if we didn't actually find any neighbor processes, but the default should be to not warn when we can't find any neighbor processes on the same node.

elliottslaughter · 2024-02-07T23:03:36Z

It sounds like we will have (what we believe to be) a much more reliable way of detecting neighbor processes on Linux, based on this comment here: #1597 (comment)

Since Linux accounts for all of our major supercomputer clusters, that would resolve the issue without the need to involve any further user input.

At that point we can probably disable the warning because I'm not sure we ever do serious cluster work with Windows or macOS.

lightsighter · 2024-02-07T23:33:25Z

At that point we can probably disable the warning because I'm not sure we ever do serious cluster work with Windows or macOS.

We are trying to support Legate on MacOS, and there are a couple of Windows users at NVIDIA of Realm. I'm not sure we can completely write those off, but I agree that we shouldn't be issuing spurious warnings if we can't do precise detection of an issue.

pmccormick · 2024-02-07T23:53:54Z

We are trying to support Legate on MacOS, and there are a couple of Windows users at NVIDIA of Realm. I'm not sure we can completely write those off

+1 for MacOS support -- for many of our legate-aligned users (perhaps well over 90%), I fully suspect MacOS is viewed as a fundamental part of their day-to-day development environment. While they could drop back to using straight Python, we're pushing them to leverage parallelism from the outset.

PHHargrove · 2024-02-08T00:00:41Z

Is the macOS use case single-node or multi-node? What GASNet conduit is in use?

I ask because if using smp-conduit (on any OS), then it is literally impossible for gex_System_QueryNbrhdInfo to under-report neighbors.

lightsighter · 2024-02-08T00:21:41Z

Is the macOS use case single-node or multi-node? What GASNet conduit is in use?

In general MacOS should be single-process per OS instance. However, there are some CUDA libraries used by Legate that require one process/GPU, so if there were ever a MacOS machine with multiple (CUDA) GPUs then we might need to run in a multi-process per OS instance scenario. They odds of there being such a machine in the future though seem exceedingly unlikely given the NVIDIA has dropped CUDA support for MacOS (for the moment), so I think we probably don't need to think too hard about this right now. The only other reason you might do multi-process per OS instance on MacOS might be to create one process per NUMA domain, but I don't know of anyone that actually wants that currently.

elliottslaughter · 2024-03-08T17:46:31Z

@mariodirenzo are you still seeing warnings on this one? I believe that cudaipc_hostname has been merged, so if I understand correctly from reading the history of this issue, the problem on Linux clusters (to a first approximation, all machines we care about) should be fixed.

mariodirenzo · 2024-03-09T12:01:22Z

The issue is fixed for the Linux cluster that I am using. Should I keep the discussion open for non-Linux systems?

muraj · 2024-03-10T01:43:49Z

The issue wouldn't be a problem on other platforms like windows, especially since we currently don't support the cuda module on windows (it current doesn't compile). Part of the effort to support windows should include effort to support IPC and the like, so no need to file a separate issue for this.

MacOS, iirc only older cuda toolkits are supported (11.6 and older, which is pascal or older), and I don't expect that to change any time soon. I may be wrong but I also do not believe there is peer gpu support available for osx either, so ipc wouldn't work in this case anyway. If/when there is both support and a use case, we can revisit, but I don't think we need to file a separate issue until they need arises, no?

lightsighter · 2024-03-10T02:26:15Z

I agree. I think we can safely close this issue for now and reopen it later if it becomes a problem on other kinds of clusters.

elliottslaughter · 2024-03-10T03:08:15Z

Closing. I agree that these systems are highly hypothetical and until we have someone who actually has such a system (along with the necessary CUDA support from NVIDIA), it's not worth holding this open.

elliottslaughter mentioned this issue Nov 11, 2023

Prioritized list of Regent features for HTR (PSAAP) #1032

Open

82 tasks

syamajala mentioned this issue Nov 14, 2023

Realm: cuMemcpy3dAsync_v2 crash on Summit #1595

Closed

lightsighter assigned eddy16112 Feb 3, 2024

elliottslaughter closed this as completed Mar 10, 2024

elliottslaughter added this to the 24.03 milestone Mar 15, 2024

Realm: Failed to send message #1597

Realm: Failed to send message #1597

Comments

mariodirenzo commented Nov 11, 2023

elliottslaughter commented Nov 11, 2023

eddy16112 commented Nov 12, 2023

mariodirenzo commented Feb 3, 2024

lightsighter commented Feb 3, 2024

eddy16112 commented Feb 3, 2024

mariodirenzo commented Feb 4, 2024

lightsighter commented Feb 4, 2024

eddy16112 commented Feb 4, 2024

eddy16112 commented Feb 4, 2024

elliottslaughter commented Feb 4, 2024

eddy16112 commented Feb 5, 2024

mariodirenzo commented Feb 5, 2024

elliottslaughter commented Feb 5, 2024

PHHargrove commented Feb 5, 2024

eddy16112 commented Feb 5, 2024

PHHargrove commented Feb 5, 2024

elliottslaughter commented Feb 6, 2024 • edited Loading

eddy16112 commented Feb 6, 2024

mariodirenzo commented Feb 6, 2024

elliottslaughter commented Feb 6, 2024

eddy16112 commented Feb 7, 2024

PHHargrove commented Feb 7, 2024

elliottslaughter commented Feb 7, 2024

PHHargrove commented Feb 7, 2024

lightsighter commented Feb 7, 2024

elliottslaughter commented Feb 7, 2024

lightsighter commented Feb 7, 2024

pmccormick commented Feb 7, 2024

PHHargrove commented Feb 8, 2024

lightsighter commented Feb 8, 2024

elliottslaughter commented Mar 8, 2024

mariodirenzo commented Mar 9, 2024

muraj commented Mar 10, 2024

lightsighter commented Mar 10, 2024

elliottslaughter commented Mar 10, 2024

elliottslaughter commented Feb 6, 2024 •

edited

Loading