Hang at start up on multinode runs #1643

mariodirenzo · 2024-03-06T14:49:50Z

If HTR is compiled on the latest version of master, it hangs at startup when executed on multiple nodes.
I think that the top-level task does not even start its execution.
The backtraces obtained on a two-node execution are contained in the attached files
bt_0.log
bt_1.log

The backtraces are produced on sapling but the problem reproduces on every system that I have tried so far.

@elliottslaughter, can you please add this issue to #1032?

The text was updated successfully, but these errors were encountered:

lightsighter · 2024-03-06T16:40:04Z

@artempriakhin @eddy16112 @muraj This suggests an issue with the DMA system at start-up.

apryakhin · 2024-03-06T18:21:00Z

Thanks! Mario, what was the latest successful commit without the hang? Was it on a control_replication branch or master branch after control_replication merge? Are you running one of your standard tests?

mariodirenzo · 2024-03-06T18:38:40Z

411fb72 works for sure.
I haven't had time to further bisect between the current head of master and that commit

muraj · 2024-03-06T18:44:19Z

It looks like it's waiting for the CUDA IPC active messages to complete. This was changed pretty recently to reduce the number of active messages sent on start up and to improve network scalability at init time. What was the commit that you saw the problem on? I don't see the commit in github's master, in gitlab the SHA is e0fc465

mariodirenzo · 2024-03-06T18:46:05Z

I am seeing the problem on b948d941b50d2bfdd01efa4f4eed5bac41b429b4

mariodirenzo · 2024-03-06T18:46:50Z

Is there a commit that I should try?

muraj · 2024-03-06T18:49:58Z

@mariodirenzo can you try just before e0fc465?

mariodirenzo · 2024-03-06T19:03:50Z

e3b8a6f works

muraj · 2024-03-06T19:17:24Z

Yeah, then it's that commit. I need logs to understand what's going on here before I can make a change. Can you give me the output of -level gpu=1 -level cudaipc=1 ?

mariodirenzo · 2024-03-06T20:54:05Z

This is the output on one node

[0 - 7f4051dffc80]    0.000000 {2}{gpu}: dynamically loading libnvidia-ml.so
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #0: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #1: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #2: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #3: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #0 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 0 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 0 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 0 to device 3 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #1 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 1 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 1 to device 2 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 1 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #2 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 2 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 2 to device 1 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 2 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #3 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 3 to device 0 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 3 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 3 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.001506 {4}{threads}: reservation ('OMP1 proc 1d00000000000004 (worker 10)') cannot be satisfied
[0 - 7f4051dffc80]    0.004134 {2}{cudaipc}: Sending cuda ipc handles to 1 peers
[0 - 7f404867dc80]    0.025525 {2}{cudaipc}: Sender 1 sent nothing to import

and this is from the other

[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: dynamically loading libnvidia-ml.so
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #0: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #1: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #2: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #3: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #0 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 0 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 0 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 0 to device 3 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #1 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 1 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 1 to device 2 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 1 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #2 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 2 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 2 to device 1 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 2 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #3 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 3 to device 0 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 3 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 3 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.018090 {4}{threads}: reservation ('OMP1 proc 1d00010000000004 (worker 10)') cannot be satisfied
[1 - 7fa028a37c80]    0.020440 {2}{cudaipc}: Sender 0 sent nothing to import
[1 - 7fa0321c8c80]    0.025115 {2}{cudaipc}: Sending cuda ipc handles to 1 peers

muraj · 2024-03-06T21:23:04Z

Ah, that makes some sense, there's a reporting issue in the active message handler. I can fix that real quick, no worries. Thanks for reporting the issue!

muraj · 2024-03-06T21:49:58Z

@mariodirenzo Sorry, but could you give me the list of arguments you give Realm? I'm curious how you got into the situation where the CUDA IPC paths are enabled, but no memories are sent to be imported. Just to confirm, by multi-node, you mean you're running this across two physically different systems correct? This sounds like the shared_peers path is still hitting the fallback path and collecting all the ranks as IPC capable, which is fine, these paths are robust to that unfortunately. But the following message perplexes me:

[1 - 7fa028a37c80] 0.020440 {2}{cudaipc}: Sender 0 sent nothing to import

I would have expected that you would have at least a GPU_FB_MEM allocated for each GPU in each rank, so there should be at least 4 entries that would have attempted to import (it would first look at the hostnames, those wouldn't have checked out and it would have likely still hung because the initialization signal was still escaped, for which I have a change for).

I need to figure out how to repo your issue as it might uncover other issues with this change that the simple fix I have won't clean up.

muraj · 2024-03-06T21:53:47Z

@mariodirenzo Could you give the following branch a try?
https://gitlab.com/StanfordLegion/legion/-/tree/cperry/cudaipc-fix

mariodirenzo · 2024-03-06T22:04:54Z

Sorry, but could you give me the list of arguments you give Realm?

This is the list of realm flags that I am using
-ll:cpu 1 -ll:ocpu 2 -ll:onuma 1 -ll:othr 15 -ll:ostack 8 -ll:util 1 -ll:io 1 -ll:bgwork 1 -ll:cpu_bgwork 100 -ll:util_bgwork 100 -ll:csize 20000 -lg:eager_alloc_percentage 30 -ll:rsize 512 -ll:ib_rsize 512 -ll:gsize 0 -ll:stacksize 8 -lg:sched -1 -lg:hysteresis 0

Just to confirm, by multi-node, you mean you're running this across two physically different systems correct?

Yes, the run was executed on two different nodes of sapling2

Could you give the following branch a try?

Sure, I'll give it a go tomorrow

eddy16112 · 2024-03-06T22:11:48Z

@muraj I can reproduce the hang on sapling and I can also confirm that your patch fixed the bug. However, I am not sure why the shared_peers is not empty. We use ipc mailbox to create the shared_peers , so it should be robust for bare metal machine. I will need to take a look at it.

muraj · 2024-03-06T22:24:32Z

@eddy16112 My guess is the ipc mailbox path is somehow disabled in this compilation. As to the realm flags, it looks like there's no -ll:gpu given, so no fbmems were allocated. I'll add a quick escape for that case, we really shouldn't be doing much inside the cuda module if there are no gpus assigned.

eddy16112 · 2024-03-06T22:29:25Z

@muraj The reason why shared_peers is not empty is because shared memory is not enabled, so we fallback to rely on the network module to report the shared_peers. The GASNetEX reports an empty shared_peers, which is correct. However, due to the logical here https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc?ref_type=heads#L3437, we do not know if an empty shared_peers means GASNetEX can not detect it or there is indeed no shared peers, so we set the shared_peers to all_peers.

muraj · 2024-03-06T22:32:21Z

yup, that's expected.

mariodirenzo · 2024-03-07T08:28:57Z

cudaipc-fix fixes the issue. Thanks for working on it

elliottslaughter · 2024-03-08T23:11:32Z

The branch was merged, so is this resolved now? https://gitlab.com/StanfordLegion/legion/-/commit/85d30f7ded41a58c544ac04ae0f4bb845a7a6b12

muraj · 2024-03-09T01:36:01Z

Yeah, should be okay now.

elliottslaughter · 2024-03-09T06:51:08Z

@mariodirenzo Go ahead and close this when you're ready.

mariodirenzo · 2024-03-09T11:05:30Z

Thanks again for fixing the issue

elliottslaughter mentioned this issue Mar 6, 2024

Prioritized list of Regent features for HTR (PSAAP) #1032

Open

82 tasks

mariodirenzo closed this as completed Mar 9, 2024

elliottslaughter added this to the 24.03 milestone Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang at start up on multinode runs #1643

Hang at start up on multinode runs #1643

mariodirenzo commented Mar 6, 2024

lightsighter commented Mar 6, 2024

apryakhin commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

muraj commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

muraj commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

muraj commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

muraj commented Mar 6, 2024

muraj commented Mar 6, 2024

muraj commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

eddy16112 commented Mar 6, 2024

muraj commented Mar 6, 2024 •

edited

Loading

eddy16112 commented Mar 6, 2024

muraj commented Mar 6, 2024

mariodirenzo commented Mar 7, 2024

elliottslaughter commented Mar 8, 2024

muraj commented Mar 9, 2024

elliottslaughter commented Mar 9, 2024

mariodirenzo commented Mar 9, 2024

Hang at start up on multinode runs #1643

Hang at start up on multinode runs #1643

Comments

mariodirenzo commented Mar 6, 2024

lightsighter commented Mar 6, 2024

apryakhin commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

muraj commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

muraj commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

muraj commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

muraj commented Mar 6, 2024

muraj commented Mar 6, 2024

muraj commented Mar 6, 2024

mariodirenzo commented Mar 6, 2024

eddy16112 commented Mar 6, 2024

muraj commented Mar 6, 2024 • edited Loading

eddy16112 commented Mar 6, 2024

muraj commented Mar 6, 2024

mariodirenzo commented Mar 7, 2024

elliottslaughter commented Mar 8, 2024

muraj commented Mar 9, 2024

elliottslaughter commented Mar 9, 2024

mariodirenzo commented Mar 9, 2024

muraj commented Mar 6, 2024 •

edited

Loading