Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm: Failed to send message #1597

Closed
Tracked by #1032
mariodirenzo opened this issue Nov 11, 2023 · 35 comments
Closed
Tracked by #1032

Realm: Failed to send message #1597

mariodirenzo opened this issue Nov 11, 2023 · 35 comments
Assignees
Milestone

Comments

@mariodirenzo
Copy link

I've started seeing the following message when using the current version of shardrefine

[1 - 14bd921b4000]    0.022116 {5}{mailbox}: Failed to send message: 111

[1 - 14bd921b4000]    0.022362 {4}{realm}: Unable to send shared memory information to node 0, skipping

Do you know what may cause it?
Is it harmful to the performance or correctness of the calculation?

@elliottslaughter, can you please add this to #1032?

@elliottslaughter
Copy link
Contributor

I don't remember the answer but @syamajala has been seeing the same thing.

@eddy16112
Copy link
Contributor

These warnings tell we are not able to create shared memory between processes. It is harmful to the performance, but should not affect the correctness. @muraj, do you know why the shared memory failed?

@mariodirenzo
Copy link
Author

After updating the runtime to a newer version of control replication, the mailbox error disappeared but the Unable to send shared memory information warnings are still there.
Is there anything I can do to help debug this issue?

@lightsighter
Copy link
Contributor

Assigning @eddy16112 and @muraj.

@eddy16112
Copy link
Contributor

The mailbox error is gone because we switched it from a warning message into an info message. According to the error message, 111 means connection refused. Are rank 1 and rank 0 on the same physical node?

@mariodirenzo
Copy link
Author

No, they are on different physical nodes

@lightsighter
Copy link
Contributor

I saw this on S3D logs too. They also were one process/node, so there shouldn't have been any warnings about about trying to exchange information between processes; they need to be suppressed unless they are actually real.

@eddy16112
Copy link
Contributor

I reproduced it. It is an issue when creating shared_peers https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnet1/gasnet1_module.cc#L633. According to the comments, we mixed the case that GASNetEX can not detect the shared_peers with the one that there is no shared_peers (like rank per node). I need to think about how to fix it.

@eddy16112
Copy link
Contributor

BTW, you can ignore this warning for now. It won't hurt anything.

@elliottslaughter
Copy link
Contributor

@mariodirenzo Are you still using gasnet1? I believe we had a discussion about deprecating it at some point; while I don't think there are any immediate plans, to the extent that you still have issues with gasnetex we should work through those.

@eddy16112
Copy link
Contributor

We are having the same issue even with gasnetex https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc#L3437-3440. I have a new PR https://gitlab.com/StanfordLegion/legion/-/merge_requests/1054 that will improve the accuracy of shared_peers by using ipc mailbox, but in the case that ipc mailbox is failed (even though the chance is very small), we fall back to rely on network modules to report shared_peers, and we will hit this issue again.

@mariodirenzo
Copy link
Author

@elliottslaughter, these executions were performed with the gasnetex network layer, though gasnet1 is still our default option because of the issues listed at #1508. When the -gex:obcount and -ll:force_kthreads issues are solved, we will be happy to make gasnetex the default network layer for HTR

@elliottslaughter
Copy link
Contributor

@eddy16112 The GASNet software itself must identify shared-memory peers in some way, because GASNet provides a portable shared-memory bypass to avoid the NIC when moving data between ranks on a node. Perhaps it would be worth learning from what they do? I don't believe I've seen failures like this when just running GASNet PSHM. Maybe @PHHargrove can comment.

@PHHargrove
Copy link
Contributor

@elliottslaughter I am not 100% sure of the context for the question, but here is the portion of the GASNet README describing the environment variable which determines how we identify shared-memory peers:

* GASNET_HOST_DETECT
 To implement gex_System_QueryHostInfo() and to construct shared-memory
 "nbrhds", GASNet must map the hosts (compute nodes) in a job.  This requires
 a unique identifier for each host.  This string-valued setting selects the
 identifier used.
 The following are implemented for most conduits:
   "gethostid" - the 32-bit value returned by POSIX gethostid()
   "hostname" - a 64-bit hash of the hostname (as reported by POSIX gethostname())
 Some conduits support:
   "conduit" - a network-specific identifier (such as a MAC address)
 On networks providing the "conduit" option, conduit-specific documentation will
 describe whether it is supported in addition to the two listed above, or if
 it is the *only* supported option.

@eddy16112
Copy link
Contributor

@PHHargrove we use gex_System_QueryNbrhdInfo to detect which ranks are on the same physical node, if it is failed (e.g. PSHM is disabled), we will use gex_System_QueryHostInfo, but gex_System_QueryHostInfo can still fail if running in containers. Do you know what methods gex_System_QueryNbrhdInfo use? will it always success even in containers as long as PSHM is enabled?

@PHHargrove
Copy link
Contributor

Do you know what methods gex_System_QueryNbrhdInfo use?

See above. The quoted text "and to construct shared-memory nbrhds" means the implementation of gex_System_QueryNbrhdInfo().

The "conduit" value is available only for aries- and udp-conduits.

If your containers are (mis)configured to all have the same hostid and identical hostnames, then there is currently no mechanism we can use to distinguish them. However, if either of the two is distinct, then the appropriate setting of the environment variable can be used.

The hostid (used by default) typically derived from the first non-loopback IP address, but can also be set in a configuration file (/etc/hostid, iirc). My guess is that the container image being used has a constant value in such a file. If so, then you should try GASNET_HOST_DETECT=hostname to instead used the (hopefully unique) hostname.

If you have suggestions for other simple identifiers we should consider using, please let me know.

@elliottslaughter
Copy link
Contributor

elliottslaughter commented Feb 6, 2024

If I understand the original issue and particularly @mariodirenzo's comment at #1597 (comment), this started happening on an actual supercomputer cluster. @mariodirenzo please correct me if I'm wrong, but I do not believe the machine in question uses containers. (@mariodirenzo it might be helpful to know which specific machine we are talking about here.)

Therefore, based on what @PHHargrove said in #1597 (comment), it sounds like either the hostname, the hostid, or both, are identical across both machines. That seems like a challenging situation to be in, and I'm surprised we haven't had issues before, because as far as I know all production usage of Legion enables GASNet PSHM, which would be vulnerable to the same failure modes.

Since we have not heard about this failure from @mariodirenzo previously with respect to PSHM, it makes me wonder if either (a) I have not adequately understood something and the issue is somewhere else, or (b) the Realm code has an additional flaw even beyond what GASNet is doing, which causes Realm to be sensitive to things GASNet is not.

@eddy16112
Copy link
Contributor

That seems like a challenging situation to be in, and I'm surprised we haven't had issues before

We did not need to know which ranks are on the physical nodes before. We now use this information to create shared memory channel to improve the performance of intra-node copies.

If you have suggestions for other simple identifiers we should consider using, please let me know.

We implemented a ipc mailbox using socket with AF_UNIX. If we can successfully send a message via this type of socket between two ranks, it means they are on the same physical nodes. Here is our code https://gitlab.com/StanfordLegion/legion/-/blob/cudaipc_hostname/runtime/realm/runtime_impl.cc#L1548-1584. However, it only works on linux now.

@mariodirenzo
Copy link
Author

@elliottslaughter, I've been seeing these messages from multi-node runs both on Lassen (https://hpc.llnl.gov/hardware/compute-platforms/lassen) and Leonardo (https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.2.1%3A+LEONARDO+Booster+UserGuide)

@elliottslaughter
Copy link
Contributor

A couple of thoughts:

  1. @eddy16112, that code appears to be a in a branch cudaipc_hostname. Is that a branch @mariodirenzo should try out (or possibly a merge of that branch into DCR)? I'm pretty sure all of these production machines are Linux.
  2. We still haven't explained why the Realm code is failing but GASNet PSHM (which has been default for a long time) has been succeeding. Is it worth a test on current control_replication with the Realm shared memory disabled to see if we can reproduce the same problem with GASNet PSHM alone?
  3. Separately from the above, should we have @mariodirenzo check the hostname and host ID of the nodes in question? (I think hostname should do it for the former, not sure what the command would be for the latter.)

@eddy16112
Copy link
Contributor

that code appears to be a in a branch cudaipc_hostname.

It will be merged into the master branch soon.

We still haven't explained why the Realm code is failing but GASNet PSHM (which has been default for a long time) has been succeeding.

It is a false alarm. Realm asks GASNet to report if there is any neighbor ranks (shared_peers) that are on the same physical node with my_rank. Because the application is run with 1 rank per node, so GASNet report 0 neighbor rank. However, because GASNet will also report 0 neighbor rank if gex_System_QueryNbrhdInfo or gex_System_QueryHostInfo is failed, we can not tell why shared_peers is 0 in this case, therefore, Realm decides to set shared_peers to all_peers(all ranks). Later, when creating shared memory channel using all_peers, we will see the warning Unable to send shared memory information to node 0, skipping. With the changes in the cudaipc_hostname branch, we will have a better way to detect shared_peers, so we will not see the warning.

@PHHargrove
Copy link
Contributor

@elliottslaughter wrote
...

  1. Separately from the above, should we have @mariodirenzo check the hostname and host ID of the nodes in question? (I think hostname should do it for the former, not sure what the command would be for the latter.)

The hostid utility should be available on most Linux systems. If not, the following should to the trick:

#include <unistd.h>
#include <stdio.h>
int main(void) {
  printf("%lx\n", gethostid());
  return 0;
}

@elliottslaughter
Copy link
Contributor

However, because GASNet will also report 0 neighbor rank if gex_System_QueryNbrhdInfo or gex_System_QueryHostInfo is failed

Perhaps these APIs could be tweaked to distinguish between 0 because it really is a rank-per-node job and 0 because of failure? @PHHargrove?

@PHHargrove
Copy link
Contributor

Perhaps these APIs could be tweaked to distinguish between 0 because it really is a rank-per-node job and 0 because of failure? @PHHargrove?

Feel free to read the sources for both functions and you will find there are no error cases and a void return type.
Both are just copying some "map" information discovered earlier to caller-provided locations.

In the mapping procedure itself, it is not possible in general to distinguish "failure" to discover neighbors from genuinely having none.

@lightsighter
Copy link
Contributor

In the mapping procedure itself, it is not possible in general to distinguish "failure" to discover neighbors from genuinely having none.

Right, and I think this is the source of the Realm warnings at the moment. We definitely shouldn't be warning users unnecessarily. Perhaps if we added a flag to tell Realm that it should expect to find other processes on the same node, then we could issue the warning if we didn't actually find any neighbor processes, but the default should be to not warn when we can't find any neighbor processes on the same node.

@elliottslaughter
Copy link
Contributor

It sounds like we will have (what we believe to be) a much more reliable way of detecting neighbor processes on Linux, based on this comment here: #1597 (comment)

Since Linux accounts for all of our major supercomputer clusters, that would resolve the issue without the need to involve any further user input.

At that point we can probably disable the warning because I'm not sure we ever do serious cluster work with Windows or macOS.

@lightsighter
Copy link
Contributor

At that point we can probably disable the warning because I'm not sure we ever do serious cluster work with Windows or macOS.

We are trying to support Legate on MacOS, and there are a couple of Windows users at NVIDIA of Realm. I'm not sure we can completely write those off, but I agree that we shouldn't be issuing spurious warnings if we can't do precise detection of an issue.

@pmccormick
Copy link
Contributor

We are trying to support Legate on MacOS, and there are a couple of Windows users at NVIDIA of Realm. I'm not sure we can completely write those off

+1 for MacOS support -- for many of our legate-aligned users (perhaps well over 90%), I fully suspect MacOS is viewed as a fundamental part of their day-to-day development environment. While they could drop back to using straight Python, we're pushing them to leverage parallelism from the outset.

@PHHargrove
Copy link
Contributor

Is the macOS use case single-node or multi-node? What GASNet conduit is in use?

I ask because if using smp-conduit (on any OS), then it is literally impossible for gex_System_QueryNbrhdInfo to under-report neighbors.

@lightsighter
Copy link
Contributor

Is the macOS use case single-node or multi-node? What GASNet conduit is in use?

In general MacOS should be single-process per OS instance. However, there are some CUDA libraries used by Legate that require one process/GPU, so if there were ever a MacOS machine with multiple (CUDA) GPUs then we might need to run in a multi-process per OS instance scenario. They odds of there being such a machine in the future though seem exceedingly unlikely given the NVIDIA has dropped CUDA support for MacOS (for the moment), so I think we probably don't need to think too hard about this right now. The only other reason you might do multi-process per OS instance on MacOS might be to create one process per NUMA domain, but I don't know of anyone that actually wants that currently.

@elliottslaughter
Copy link
Contributor

@mariodirenzo are you still seeing warnings on this one? I believe that cudaipc_hostname has been merged, so if I understand correctly from reading the history of this issue, the problem on Linux clusters (to a first approximation, all machines we care about) should be fixed.

@mariodirenzo
Copy link
Author

The issue is fixed for the Linux cluster that I am using. Should I keep the discussion open for non-Linux systems?

@muraj
Copy link

muraj commented Mar 10, 2024

The issue wouldn't be a problem on other platforms like windows, especially since we currently don't support the cuda module on windows (it current doesn't compile). Part of the effort to support windows should include effort to support IPC and the like, so no need to file a separate issue for this.

MacOS, iirc only older cuda toolkits are supported (11.6 and older, which is pascal or older), and I don't expect that to change any time soon. I may be wrong but I also do not believe there is peer gpu support available for osx either, so ipc wouldn't work in this case anyway. If/when there is both support and a use case, we can revisit, but I don't think we need to file a separate issue until they need arises, no?

@lightsighter
Copy link
Contributor

I agree. I think we can safely close this issue for now and reopen it later if it becomes a problem on other kinds of clusters.

@elliottslaughter
Copy link
Contributor

Closing. I agree that these systems are highly hypothetical and until we have someone who actually has such a system (along with the necessary CUDA support from NVIDIA), it's not worth holding this open.

@elliottslaughter elliottslaughter added this to the 24.03 milestone Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants