-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang at start up on multinode runs #1643
Comments
@artempriakhin @eddy16112 @muraj This suggests an issue with the DMA system at start-up. |
Thanks! Mario, what was the latest successful commit without the hang? Was it on a |
411fb72 works for sure. |
It looks like it's waiting for the CUDA IPC active messages to complete. This was changed pretty recently to reduce the number of active messages sent on start up and to improve network scalability at init time. What was the commit that you saw the problem on? I don't see the commit in github's master, in gitlab the SHA is e0fc465 |
I am seeing the problem on |
Is there a commit that I should try? |
@mariodirenzo can you try just before e0fc465? |
e3b8a6f works |
Yeah, then it's that commit. I need logs to understand what's going on here before I can make a change. Can you give me the output of |
This is the output on one node
and this is from the other
|
Ah, that makes some sense, there's a reporting issue in the active message handler. I can fix that real quick, no worries. Thanks for reporting the issue! |
@mariodirenzo Sorry, but could you give me the list of arguments you give Realm? I'm curious how you got into the situation where the CUDA IPC paths are enabled, but no memories are sent to be imported. Just to confirm, by multi-node, you mean you're running this across two physically different systems correct? This sounds like the shared_peers path is still hitting the fallback path and collecting all the ranks as IPC capable, which is fine, these paths are robust to that unfortunately. But the following message perplexes me:
I would have expected that you would have at least a GPU_FB_MEM allocated for each GPU in each rank, so there should be at least 4 entries that would have attempted to import (it would first look at the hostnames, those wouldn't have checked out and it would have likely still hung because the initialization signal was still escaped, for which I have a change for). I need to figure out how to repo your issue as it might uncover other issues with this change that the simple fix I have won't clean up. |
@mariodirenzo Could you give the following branch a try? |
This is the list of realm flags that I am using
Yes, the run was executed on two different nodes of sapling2
Sure, I'll give it a go tomorrow |
@muraj I can reproduce the hang on sapling and I can also confirm that your patch fixed the bug. However, I am not sure why the |
@eddy16112 My guess is the ipc mailbox path is somehow disabled in this compilation. As to the realm flags, it looks like there's no -ll:gpu given, so no fbmems were allocated. I'll add a quick escape for that case, we really shouldn't be doing much inside the cuda module if there are no gpus assigned. |
@muraj The reason why |
yup, that's expected. |
cudaipc-fix fixes the issue. Thanks for working on it |
The branch was merged, so is this resolved now? https://gitlab.com/StanfordLegion/legion/-/commit/85d30f7ded41a58c544ac04ae0f4bb845a7a6b12 |
Yeah, should be okay now. |
@mariodirenzo Go ahead and close this when you're ready. |
Thanks again for fixing the issue |
If HTR is compiled on the latest version of
master
, it hangs at startup when executed on multiple nodes.I think that the top-level task does not even start its execution.
The backtraces obtained on a two-node execution are contained in the attached files
bt_0.log
bt_1.log
The backtraces are produced on sapling but the problem reproduces on every system that I have tried so far.
@elliottslaughter, can you please add this issue to #1032?
The text was updated successfully, but these errors were encountered: