Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang at start up on multinode runs #1643

Closed
Tracked by #1032
mariodirenzo opened this issue Mar 6, 2024 · 23 comments
Closed
Tracked by #1032

Hang at start up on multinode runs #1643

mariodirenzo opened this issue Mar 6, 2024 · 23 comments
Milestone

Comments

@mariodirenzo
Copy link

If HTR is compiled on the latest version of master, it hangs at startup when executed on multiple nodes.
I think that the top-level task does not even start its execution.
The backtraces obtained on a two-node execution are contained in the attached files
bt_0.log
bt_1.log

The backtraces are produced on sapling but the problem reproduces on every system that I have tried so far.

@elliottslaughter, can you please add this issue to #1032?

@lightsighter
Copy link
Contributor

@artempriakhin @eddy16112 @muraj This suggests an issue with the DMA system at start-up.

@apryakhin
Copy link
Contributor

Thanks! Mario, what was the latest successful commit without the hang? Was it on a control_replication branch or master branch after control_replication merge? Are you running one of your standard tests?

@mariodirenzo
Copy link
Author

411fb72 works for sure.
I haven't had time to further bisect between the current head of master and that commit

@muraj
Copy link

muraj commented Mar 6, 2024

It looks like it's waiting for the CUDA IPC active messages to complete. This was changed pretty recently to reduce the number of active messages sent on start up and to improve network scalability at init time. What was the commit that you saw the problem on? I don't see the commit in github's master, in gitlab the SHA is e0fc465

@mariodirenzo
Copy link
Author

I am seeing the problem on b948d941b50d2bfdd01efa4f4eed5bac41b429b4

@mariodirenzo
Copy link
Author

Is there a commit that I should try?

@muraj
Copy link

muraj commented Mar 6, 2024

@mariodirenzo can you try just before e0fc465?

@mariodirenzo
Copy link
Author

e3b8a6f works

@muraj
Copy link

muraj commented Mar 6, 2024

Yeah, then it's that commit. I need logs to understand what's going on here before I can make a change. Can you give me the output of -level gpu=1 -level cudaipc=1 ?

@mariodirenzo
Copy link
Author

This is the output on one node

[0 - 7f4051dffc80]    0.000000 {2}{gpu}: dynamically loading libnvidia-ml.so
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #0: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #1: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #2: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #3: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #0 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 0 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 0 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 0 to device 3 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #1 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 1 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 1 to device 2 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 1 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #2 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 2 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 2 to device 1 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 2 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #3 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 3 to device 0 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 3 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 3 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.001506 {4}{threads}: reservation ('OMP1 proc 1d00000000000004 (worker 10)') cannot be satisfied
[0 - 7f4051dffc80]    0.004134 {2}{cudaipc}: Sending cuda ipc handles to 1 peers
[0 - 7f404867dc80]    0.025525 {2}{cudaipc}: Sender 1 sent nothing to import

and this is from the other

[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: dynamically loading libnvidia-ml.so
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #0: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #1: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #2: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #3: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #0 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 0 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 0 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 0 to device 3 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #1 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 1 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 1 to device 2 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 1 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #2 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 2 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 2 to device 1 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 2 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #3 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 3 to device 0 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 3 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 3 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.018090 {4}{threads}: reservation ('OMP1 proc 1d00010000000004 (worker 10)') cannot be satisfied
[1 - 7fa028a37c80]    0.020440 {2}{cudaipc}: Sender 0 sent nothing to import
[1 - 7fa0321c8c80]    0.025115 {2}{cudaipc}: Sending cuda ipc handles to 1 peers

@muraj
Copy link

muraj commented Mar 6, 2024

Ah, that makes some sense, there's a reporting issue in the active message handler. I can fix that real quick, no worries. Thanks for reporting the issue!

@muraj
Copy link

muraj commented Mar 6, 2024

@mariodirenzo Sorry, but could you give me the list of arguments you give Realm? I'm curious how you got into the situation where the CUDA IPC paths are enabled, but no memories are sent to be imported. Just to confirm, by multi-node, you mean you're running this across two physically different systems correct? This sounds like the shared_peers path is still hitting the fallback path and collecting all the ranks as IPC capable, which is fine, these paths are robust to that unfortunately. But the following message perplexes me:

[1 - 7fa028a37c80] 0.020440 {2}{cudaipc}: Sender 0 sent nothing to import

I would have expected that you would have at least a GPU_FB_MEM allocated for each GPU in each rank, so there should be at least 4 entries that would have attempted to import (it would first look at the hostnames, those wouldn't have checked out and it would have likely still hung because the initialization signal was still escaped, for which I have a change for).

I need to figure out how to repo your issue as it might uncover other issues with this change that the simple fix I have won't clean up.

@muraj
Copy link

muraj commented Mar 6, 2024

@mariodirenzo Could you give the following branch a try?
https://gitlab.com/StanfordLegion/legion/-/tree/cperry/cudaipc-fix

@mariodirenzo
Copy link
Author

Sorry, but could you give me the list of arguments you give Realm?

This is the list of realm flags that I am using
-ll:cpu 1 -ll:ocpu 2 -ll:onuma 1 -ll:othr 15 -ll:ostack 8 -ll:util 1 -ll:io 1 -ll:bgwork 1 -ll:cpu_bgwork 100 -ll:util_bgwork 100 -ll:csize 20000 -lg:eager_alloc_percentage 30 -ll:rsize 512 -ll:ib_rsize 512 -ll:gsize 0 -ll:stacksize 8 -lg:sched -1 -lg:hysteresis 0

Just to confirm, by multi-node, you mean you're running this across two physically different systems correct?

Yes, the run was executed on two different nodes of sapling2

Could you give the following branch a try?

Sure, I'll give it a go tomorrow

@eddy16112
Copy link
Contributor

@muraj I can reproduce the hang on sapling and I can also confirm that your patch fixed the bug. However, I am not sure why the shared_peers is not empty. We use ipc mailbox to create the shared_peers , so it should be robust for bare metal machine. I will need to take a look at it.

@muraj
Copy link

muraj commented Mar 6, 2024

@eddy16112 My guess is the ipc mailbox path is somehow disabled in this compilation. As to the realm flags, it looks like there's no -ll:gpu given, so no fbmems were allocated. I'll add a quick escape for that case, we really shouldn't be doing much inside the cuda module if there are no gpus assigned.

@eddy16112
Copy link
Contributor

@muraj The reason why shared_peers is not empty is because shared memory is not enabled, so we fallback to rely on the network module to report the shared_peers. The GASNetEX reports an empty shared_peers, which is correct. However, due to the logical here https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc?ref_type=heads#L3437, we do not know if an empty shared_peers means GASNetEX can not detect it or there is indeed no shared peers, so we set the shared_peers to all_peers.

@muraj
Copy link

muraj commented Mar 6, 2024

yup, that's expected.

@mariodirenzo
Copy link
Author

cudaipc-fix fixes the issue. Thanks for working on it

@elliottslaughter
Copy link
Contributor

The branch was merged, so is this resolved now? https://gitlab.com/StanfordLegion/legion/-/commit/85d30f7ded41a58c544ac04ae0f4bb845a7a6b12

@muraj
Copy link

muraj commented Mar 9, 2024

Yeah, should be okay now.

@elliottslaughter
Copy link
Contributor

@mariodirenzo Go ahead and close this when you're ready.

@mariodirenzo
Copy link
Author

Thanks again for fixing the issue

@elliottslaughter elliottslaughter added this to the 24.03 milestone Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants