-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Realm: Failed to send message #1597
Comments
I don't remember the answer but @syamajala has been seeing the same thing. |
These warnings tell we are not able to create shared memory between processes. It is harmful to the performance, but should not affect the correctness. @muraj, do you know why the shared memory failed? |
After updating the runtime to a newer version of control replication, the |
Assigning @eddy16112 and @muraj. |
The |
No, they are on different physical nodes |
I saw this on S3D logs too. They also were one process/node, so there shouldn't have been any warnings about about trying to exchange information between processes; they need to be suppressed unless they are actually real. |
I reproduced it. It is an issue when creating shared_peers https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnet1/gasnet1_module.cc#L633. According to the comments, we mixed the case that GASNetEX can not detect the shared_peers with the one that there is no shared_peers (like rank per node). I need to think about how to fix it. |
BTW, you can ignore this warning for now. It won't hurt anything. |
@mariodirenzo Are you still using gasnet1? I believe we had a discussion about deprecating it at some point; while I don't think there are any immediate plans, to the extent that you still have issues with gasnetex we should work through those. |
We are having the same issue even with gasnetex https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc#L3437-3440. I have a new PR https://gitlab.com/StanfordLegion/legion/-/merge_requests/1054 that will improve the accuracy of |
@elliottslaughter, these executions were performed with the gasnetex network layer, though gasnet1 is still our default option because of the issues listed at #1508. When the |
@eddy16112 The GASNet software itself must identify shared-memory peers in some way, because GASNet provides a portable shared-memory bypass to avoid the NIC when moving data between ranks on a node. Perhaps it would be worth learning from what they do? I don't believe I've seen failures like this when just running GASNet PSHM. Maybe @PHHargrove can comment. |
@elliottslaughter I am not 100% sure of the context for the question, but here is the portion of the GASNet
|
@PHHargrove we use |
See above. The quoted text "and to construct shared-memory nbrhds" means the implementation of The "conduit" value is available only for aries- and udp-conduits. If your containers are (mis)configured to all have the same hostid and identical hostnames, then there is currently no mechanism we can use to distinguish them. However, if either of the two is distinct, then the appropriate setting of the environment variable can be used. The hostid (used by default) typically derived from the first non-loopback IP address, but can also be set in a configuration file ( If you have suggestions for other simple identifiers we should consider using, please let me know. |
If I understand the original issue and particularly @mariodirenzo's comment at #1597 (comment), this started happening on an actual supercomputer cluster. @mariodirenzo please correct me if I'm wrong, but I do not believe the machine in question uses containers. (@mariodirenzo it might be helpful to know which specific machine we are talking about here.) Therefore, based on what @PHHargrove said in #1597 (comment), it sounds like either the hostname, the hostid, or both, are identical across both machines. That seems like a challenging situation to be in, and I'm surprised we haven't had issues before, because as far as I know all production usage of Legion enables GASNet PSHM, which would be vulnerable to the same failure modes. Since we have not heard about this failure from @mariodirenzo previously with respect to PSHM, it makes me wonder if either (a) I have not adequately understood something and the issue is somewhere else, or (b) the Realm code has an additional flaw even beyond what GASNet is doing, which causes Realm to be sensitive to things GASNet is not. |
We did not need to know which ranks are on the physical nodes before. We now use this information to create shared memory channel to improve the performance of intra-node copies.
We implemented a ipc mailbox using socket with AF_UNIX. If we can successfully send a message via this type of socket between two ranks, it means they are on the same physical nodes. Here is our code https://gitlab.com/StanfordLegion/legion/-/blob/cudaipc_hostname/runtime/realm/runtime_impl.cc#L1548-1584. However, it only works on linux now. |
@elliottslaughter, I've been seeing these messages from multi-node runs both on Lassen (https://hpc.llnl.gov/hardware/compute-platforms/lassen) and Leonardo (https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.2.1%3A+LEONARDO+Booster+UserGuide) |
A couple of thoughts:
|
It will be merged into the master branch soon.
It is a false alarm. Realm asks GASNet to report if there is any neighbor ranks ( |
@elliottslaughter wrote
The #include <unistd.h>
#include <stdio.h>
int main(void) {
printf("%lx\n", gethostid());
return 0;
} |
Perhaps these APIs could be tweaked to distinguish between 0 because it really is a rank-per-node job and 0 because of failure? @PHHargrove? |
Feel free to read the sources for both functions and you will find there are no error cases and a In the mapping procedure itself, it is not possible in general to distinguish "failure" to discover neighbors from genuinely having none. |
Right, and I think this is the source of the Realm warnings at the moment. We definitely shouldn't be warning users unnecessarily. Perhaps if we added a flag to tell Realm that it should expect to find other processes on the same node, then we could issue the warning if we didn't actually find any neighbor processes, but the default should be to not warn when we can't find any neighbor processes on the same node. |
It sounds like we will have (what we believe to be) a much more reliable way of detecting neighbor processes on Linux, based on this comment here: #1597 (comment) Since Linux accounts for all of our major supercomputer clusters, that would resolve the issue without the need to involve any further user input. At that point we can probably disable the warning because I'm not sure we ever do serious cluster work with Windows or macOS. |
We are trying to support Legate on MacOS, and there are a couple of Windows users at NVIDIA of Realm. I'm not sure we can completely write those off, but I agree that we shouldn't be issuing spurious warnings if we can't do precise detection of an issue. |
+1 for MacOS support -- for many of our legate-aligned users (perhaps well over 90%), I fully suspect MacOS is viewed as a fundamental part of their day-to-day development environment. While they could drop back to using straight Python, we're pushing them to leverage parallelism from the outset. |
Is the macOS use case single-node or multi-node? What GASNet conduit is in use? I ask because if using smp-conduit (on any OS), then it is literally impossible for |
In general MacOS should be single-process per OS instance. However, there are some CUDA libraries used by Legate that require one process/GPU, so if there were ever a MacOS machine with multiple (CUDA) GPUs then we might need to run in a multi-process per OS instance scenario. They odds of there being such a machine in the future though seem exceedingly unlikely given the NVIDIA has dropped CUDA support for MacOS (for the moment), so I think we probably don't need to think too hard about this right now. The only other reason you might do multi-process per OS instance on MacOS might be to create one process per NUMA domain, but I don't know of anyone that actually wants that currently. |
@mariodirenzo are you still seeing warnings on this one? I believe that |
The issue is fixed for the Linux cluster that I am using. Should I keep the discussion open for non-Linux systems? |
The issue wouldn't be a problem on other platforms like windows, especially since we currently don't support the cuda module on windows (it current doesn't compile). Part of the effort to support windows should include effort to support IPC and the like, so no need to file a separate issue for this. MacOS, iirc only older cuda toolkits are supported (11.6 and older, which is pascal or older), and I don't expect that to change any time soon. I may be wrong but I also do not believe there is peer gpu support available for osx either, so ipc wouldn't work in this case anyway. If/when there is both support and a use case, we can revisit, but I don't think we need to file a separate issue until they need arises, no? |
I agree. I think we can safely close this issue for now and reopen it later if it becomes a problem on other kinds of clusters. |
Closing. I agree that these systems are highly hypothetical and until we have someone who actually has such a system (along with the necessary CUDA support from NVIDIA), it's not worth holding this open. |
I've started seeing the following message when using the current version of
shardrefine
Do you know what may cause it?
Is it harmful to the performance or correctness of the calculation?
@elliottslaughter, can you please add this to #1032?
The text was updated successfully, but these errors were encountered: