Deadlock when running with AMD GPUs #1688

mariodirenzo · 2024-04-30T15:07:40Z

When running HTR++ unit tests on Tioga, some tests freeze without any error message.
The freeze is deterministic and happens only when an AMD gpu is utilized.
The backtraces of a hanging execution look like this:

* thread #1, name = 'averageTest.exe', stop reason = signal SIGSTOP
  * frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x00001555553e7200) at mutex.cc:265:26
  thread #2, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb7cb libc.so.6`ioctl + 11
    frame #1: 0x0000155545d2b748 libhsa-runtime64.so.1`kmtIoctl + 40
    frame #2: 0x0000155545d249d5 libhsa-runtime64.so.1`hsaKmtWaitOnMultipleEvents_Ext + 245
    frame #3: 0x0000155545cbc784 libhsa-runtime64.so.1`rocr::core::Signal::WaitAny(unsigned int, hsa_signal_s const*, hsa_signal_condition_t const*, long const*, unsigned long, hsa_wait_state_t, long*) + 4228
    frame #4: 0x0000155545c9822e libhsa-runtime64.so.1`rocr::AMD::hsa_amd_signal_wait_any(unsigned int, hsa_signal_s*, hsa_signal_condition_t*, long*, unsigned long, hsa_wait_state_t, long*) + 94
    frame #5: 0x0000155545cae0ea libhsa-runtime64.so.1`rocr::core::Runtime::AsyncEventsLoop(void*) + 154
    frame #6: 0x0000155545c68677 libhsa-runtime64.so.1`rocr::os::ThreadTrampoline(void*) + 23
    frame #7: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #8: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #3, name = 'averageTest.exe'
    frame #0: 0x0000155553d250f5 libamdhip64.so.6`bool roc::VirtualGPU::dispatchGenericAqlPacket<hsa_kernel_dispatch_packet_s>(hsa_kernel_dispatch_packet_s*, unsigned short, unsigned short, bool, unsigned long) + 645
    frame #1: 0x0000155553d22f67 libamdhip64.so.6`roc::VirtualGPU::submitKernelInternal(amd::NDRangeContainer const&, amd::Kernel const&, unsigned char const*, void*, unsigned int, amd::NDRangeKernelCommand*, hsa_kernel_dispatch_packet_s*) + 3751
    frame #2: 0x0000155553d4caa5 libamdhip64.so.6`roc::KernelBlitManager::fillBuffer1D(device::Memory&, void const*, unsigned long, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, bool, bool) const + 1317
    frame #3: 0x0000155553d451f6 libamdhip64.so.6`roc::KernelBlitManager::fillBuffer(device::Memory&, void const*, unsigned long, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, bool, bool) const + 550
    frame #4: 0x0000155553d1a432 libamdhip64.so.6`roc::VirtualGPU::fillMemory(unsigned int, amd::Memory*, void const*, unsigned long, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, bool) + 706
    frame #5: 0x0000155553d1c03f libamdhip64.so.6`roc::VirtualGPU::submitFillMemory(amd::FillMemoryCommand&) + 191
    frame #6: 0x0000155553cf3fe1 libamdhip64.so.6`amd::Command::enqueue() + 1137
    frame #7: 0x0000155553baf9e8 libamdhip64.so.6`ihipMemset3D(hipPitchedPtr, int, hipExtent, ihipStream_t*, bool) + 248
    frame #8: 0x0000155553bafb8b libamdhip64.so.6`hipMemset2DAsync_common(void*, unsigned long, int, unsigned long, unsigned long, ihipStream_t*) + 171
    frame #9: 0x0000155553bec222 libamdhip64.so.6`hipMemset2DAsync + 466
    frame #10: 0x000000000410a55a averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c98175190, channel=0x00000000054d3050, work_until=<unavailable>) at hip_internal.cc:953:19
    frame #11: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000054d3088, work_until=<unavailable>) at channel.inl:157:35
    frame #12: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155537ffc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
    frame #13: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x0000000004ea9f50) at bgwork.cc:103:22
    frame #14: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005287e00) at threads.cc:831:29
    frame #15: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #16: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #4, name = 'averageTest.exe'
    frame #0: 0x000015554e75b40b libc.so.6`sysmalloc + 379
    frame #1: 0x000015554e75c840 libc.so.6`_int_malloc + 3392
    frame #2: 0x000015554e75d972 libc.so.6`malloc + 498
    frame #3: 0x000015554ed35d7c libstdc++.so.6`operator new(unsigned long) + 28
    frame #4: 0x0000155553baf790 libamdhip64.so.6`ihipMemset3DCommand(std::vector<amd::Command*, std::allocator<amd::Command*>>&, hipPitchedPtr, int, hipExtent, hip::Stream*, unsigned long) + 368
    frame #5: 0x0000155553baf9c4 libamdhip64.so.6`ihipMemset3D(hipPitchedPtr, int, hipExtent, ihipStream_t*, bool) + 212
    frame #6: 0x0000155553bafb8b libamdhip64.so.6`hipMemset2DAsync_common(void*, unsigned long, int, unsigned long, unsigned long, ihipStream_t*) + 171
    frame #7: 0x0000155553bec222 libamdhip64.so.6`hipMemset2DAsync + 466
    frame #8: 0x000000000410a55a averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c983b4fb0, channel=0x00000000054d3050, work_until=<unavailable>) at hip_internal.cc:953:19
    frame #9: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000054d3088, work_until=<unavailable>) at channel.inl:157:35
    frame #10: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155437dfc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
    frame #11: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x0000000004eaa100) at bgwork.cc:103:22
    frame #12: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005287fb0) at threads.cc:831:29
    frame #13: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #14: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #5, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb1677200) at mutex.cc:265:26
    frame #2: 0x00000000040cc7f4 averageTest.exec`Realm::KernelThreadTaskScheduler::worker_sleep(this=0x0000000004f677d0, switch_to=0x0000154d300141c0) at tasks.cc:1469:17
    frame #3: 0x00000000040cd4ea averageTest.exec`Realm::ThreadedTaskScheduler::thread_blocking(this=0x0000000004f677d0, thread=<unavailable>) at tasks.cc:963:14
  thread #6, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb1472200) at mutex.cc:265:26
    frame #2: 0x00000000040c8c08 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(unsigned long) [inlined] Realm::ThreadedTaskScheduler::wait_for_work(old_work_counter=<unavailable>, this=0x00000000054fead0) at tasks.cc:1294:33
    frame #3: 0x00000000040c8c00 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(this=0x00000000054fead0, old_work_counter=<unavailable>) at tasks.cc:1275:8
    frame #4: 0x00000000040d0053 averageTest.exec`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x00000000054fead0) at tasks.cc:1260:21
  thread #7, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000155337bff200) at mutex.cc:265:26
  thread #8, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb126d200) at mutex.cc:265:26
    frame #2: 0x00000000040c8c08 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(unsigned long) [inlined] Realm::ThreadedTaskScheduler::wait_for_work(old_work_counter=<unavailable>, this=0x0000000004f677d0) at tasks.cc:1294:33
    frame #3: 0x00000000040c8c00 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(this=0x0000000004f677d0, old_work_counter=<unavailable>) at tasks.cc:1275:8
    frame #4: 0x00000000040d0053 averageTest.exec`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x0000000004f677d0) at tasks.cc:1260:21
  thread #9, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb7cb libc.so.6`ioctl + 11
    frame #1: 0x0000155545d2b748 libhsa-runtime64.so.1`kmtIoctl + 40
    frame #2: 0x0000155545d249d5 libhsa-runtime64.so.1`hsaKmtWaitOnMultipleEvents_Ext + 245
    frame #3: 0x0000155545d25187 libhsa-runtime64.so.1`hsaKmtWaitOnEvent_Ext + 39
    frame #4: 0x0000155545c9cfb5 libhsa-runtime64.so.1`rocr::core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) + 693
    frame #5: 0x0000155545c9ccba libhsa-runtime64.so.1`rocr::core::InterruptSignal::WaitAcquire(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) + 10
    frame #6: 0x0000155545c924d1 libhsa-runtime64.so.1`rocr::HSA::hsa_signal_wait_scacquire(hsa_signal_s, hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) + 209
    frame #7: 0x0000155553d3e575 libamdhip64.so.6`HostcallListener::consumePackets() + 133
    frame #8: 0x0000155553ae6e09 libamdhip64.so.6`amd::Thread::main() + 105
    frame #9: 0x0000155553cee8cf libamdhip64.so.6`amd::Thread::entry(amd::Thread*) + 79
    frame #10: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #11: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #10, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb0e63200) at mutex.cc:265:26
    frame #2: 0x00000000040cc7f4 averageTest.exec`Realm::KernelThreadTaskScheduler::worker_sleep(this=0x0000000004a00f40, switch_to=0x0000154c98249690) at tasks.cc:1469:17
    frame #3: 0x00000000040cd4ea averageTest.exec`Realm::ThreadedTaskScheduler::thread_blocking(this=0x0000000004a00f40, thread=<unavailable>) at tasks.cc:963:14
    frame #4: 0x0000000004108901 averageTest.exec`Realm::Hip::GPUfillChannel::create_xfer_des(this=<unavailable>, dma_op=<unavailable>, launch_node=<unavailable>, guid=<unavailable>, inputs_info=<unavailable>, outputs_info=<unavailable>, priority=-1740931136, redop_info=(id = -1740931136, is_fold = true, in_place = true, is_exclusive = false), fill_data=0x0000154c983b7fc0, fill_size=1, fill_total=7236833163134198317) at hip_internal.cc:1105:7
  thread #11, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb187c200) at mutex.cc:265:26
    frame #2: 0x00000000040c8c08 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(unsigned long) [inlined] Realm::ThreadedTaskScheduler::wait_for_work(old_work_counter=<unavailable>, this=0x0000000004a00f40) at tasks.cc:1294:33
    frame #3: 0x00000000040c8c00 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(this=0x0000000004a00f40, old_work_counter=<unavailable>) at tasks.cc:1275:8
    frame #4: 0x00000000040d0053 averageTest.exec`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x0000000004a00f40) at tasks.cc:1260:21

Do you have any advice on what might be going wrong?

@elliottslaughter, can you please add this issue to #1032?

The text was updated successfully, but these errors were encountered:

lightsighter · 2024-04-30T16:57:17Z

Presuming these backtraces are not changing over time, this is guaranteed to be a bug in AMD's driver. It should never be possible for a thread to be stuck in here:

frame #0: 0x0000155553d250f5 libamdhip64.so.6`bool roc::VirtualGPU::dispatchGenericAqlPacket<hsa_kernel_dispatch_packet_s>(hsa_kernel_dispatch_packet_s*, unsigned short, unsigned short, bool, unsigned long) + 645

All calls into ROCm should always return in finite time.

elliottslaughter · 2024-04-30T17:27:08Z

I agree with @lightsighter's assessment that this is likely a ROCm bug, or at least an issue with how ROCm is configured.

@mariodirenzo can you tell us more about your configuration?

What ROCm version is this?
How many GPUs (really GCDs) per node?
How many GPUs (GCDs) per process?

I know there are some variables related to resources assigned to each process that by default are not configured in an optimal way for Legion.

mariodirenzo · 2024-04-30T18:05:19Z

What ROCm version is this?

6.0.3

How many GPUs (really GCDs) per node?

This is a node of Tioga (https://hpc.llnl.gov/hardware/compute-platforms/tioga), which has 4 GPUs

How many GPUs (GCDs) per process?

I'm using one process with one GPU

elliottslaughter · 2024-04-30T18:14:23Z

Is this a C++ code? Because Regent doesn't support that ROCm version.

For what it's worth, we've been hitting a lot of ROCm issues with S3D, though our symptoms are different (crashes with an out of resource message, rather than hangs). The advice we've been given so far has been to test three things:

(easy) Set GPU_MAX_HW_QUEUES to different values to see if the problem goes away. My understanding is that the default value of this variable is 4, and the limit is 24 per GCD (note that 2 are reserved for data transfer). So you could probably try values of 8 or 16.
(medium) Set AMD_LOG_LEVEL=4 and save the log files. Note these logs will be quite large and probably can only be interpreted by an AMD staff person, so it may not make sense to do this until you get support involved.
(hard) Try different ROCm versions, since bugs get fixed and sometimes regressions get added. That hasn't helped our particular use case in S3D but it might still be worth trying for HTR.

Overall, we are probably in territory where it would be appropriate to contact Tioga support and ideally get AMD involved in helping you debug this issue.

mariodirenzo · 2024-05-04T09:33:39Z

Is this a C++ code?

Yes, this is C++ only

(easy) Set GPU_MAX_HW_QUEUES to different values to see if the problem goes away. My understanding is that the default value of this variable is 4, and the limit is 24 per GCD (note that 2 are reserved for data transfer). So you could probably try values of 8 or 16.

this didn't make any difference.

I've also noticed that the bt of thread 4 is changing. Sometimes I get

  thread #4, name = 'averageTest.exe', stop reason = signal SIGSTOP
    frame #0: 0x0000155553afd6a3 libamdhip64.so.6`amd::Monitor::unlock() + 35
    frame #1: 0x0000155553d4dafc libamdhip64.so.6`roc::KernelBlitManager::copyBufferRect(device::Memory&, device::Memory&, amd::BufferRect const&, amd::BufferRect const&, amd::Coord3D const&, bool, amd::CopyMetadata) const + 1372
    frame #2: 0x0000155553d1a08a libamdhip64.so.6`roc::VirtualGPU::copyMemory(unsigned int, amd::Memory&, amd::Memory&, bool, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, amd::BufferRect const&, amd::BufferRect const&, amd::CopyMetadata) + 650
    frame #3: 0x0000155553d1be69 libamdhip64.so.6`roc::VirtualGPU::submitCopyMemory(amd::CopyMemoryCommand&) + 185
    frame #4: 0x0000155553cf3fe1 libamdhip64.so.6`amd::Command::enqueue() + 1137
    frame #5: 0x0000155553bab75e libamdhip64.so.6`ihipMemcpyParam3D(HIP_MEMCPY3D const*, ihipStream_t*, bool) + 1086
    frame #6: 0x0000155553bab93b libamdhip64.so.6`ihipMemcpyParam2D(hip_Memcpy2D const*, ihipStream_t*, bool) + 203
    frame #7: 0x0000155553baba1c libamdhip64.so.6`ihipMemcpy2D(void*, unsigned long, void const*, unsigned long, unsigned long, unsigned long, hipMemcpyKind, ihipStream_t*, bool) + 204
    frame #8: 0x0000155553bcb806 libamdhip64.so.6`hipMemcpy2DAsync + 662
    frame #9: 0x000000000410ad9c averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c9808ffa0, channel=0x00000000051c8a90, work_until=<unavailable>) at hip_internal.cc:1021:25
    frame #10: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000051c8ac8, work_until=<unavailable>) at channel.inl:157:35
    frame #11: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155437ffc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
    frame #12: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x00000000050dc5b0) at bgwork.cc:103:22
    frame #13: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005283e90) at threads.cc:831:29
    frame #14: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #15: 0x000015554e6fbe73 libc.so.6`__clone + 67

sometimes I get

  thread #4, name = 'averageTest.exe'
    frame #0: 0x000015554e75b40b libc.so.6`sysmalloc + 379
    frame #1: 0x000015554e75c840 libc.so.6`_int_malloc + 3392
    frame #2: 0x000015554e75d972 libc.so.6`malloc + 498
    frame #3: 0x000015554ed35d7c libstdc++.so.6`operator new(unsigned long) + 28
    frame #4: 0x0000155553baf790 libamdhip64.so.6`ihipMemset3DCommand(std::vector<amd::Command*, std::allocator<amd::Command*>>&, hipPitchedPtr, int, hipExtent, hip::Stream*, unsigned long) + 368
    frame #5: 0x0000155553baf9c4 libamdhip64.so.6`ihipMemset3D(hipPitchedPtr, int, hipExtent, ihipStream_t*, bool) + 212
    frame #6: 0x0000155553bafb8b libamdhip64.so.6`hipMemset2DAsync_common(void*, unsigned long, int, unsigned long, unsigned long, ihipStream_t*) + 171
    frame #7: 0x0000155553bec222 libamdhip64.so.6`hipMemset2DAsync + 466
    frame #8: 0x000000000410a55a averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c9808ffa0, channel=0x00000000051c8a90, work_until=<unavailable>) at hip_internal.cc:953:19
    frame #9: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000051c8ac8, work_until=<unavailable>) at channel.inl:157:35
    frame #10: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155437ffc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
    frame #11: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x00000000050dc5b0) at bgwork.cc:103:22
    frame #12: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005283e90) at threads.cc:831:29
    frame #13: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #14: 0x000015554e6fbe73 libc.so.6`__clone + 67

elliottslaughter · 2024-05-04T15:10:23Z

If these backtraces are changing within a single run, that would indicate that the code is not deadlocked but is running very slowly.

I don't know if this is still applicable, but at one point fills on HIP we're known to be extremely slow: #1236

I haven't had the opportunity to check any recent HIP versions to see if it got fixed, but that seems like a relatively self contained test you could do.

mariodirenzo · 2024-05-06T13:44:02Z

If these backtraces are changing within a single run, that would indicate that the code is not deadlocked but is running very slowly.

I'm not sure about it. The test should take approximately 0.6s and I've run it for more than 30 minutes without getting any progress. So, it is running slowly, it is incredibly slow.

Every time I extract a backtrace, I see thread 4 in this function either at this line https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/hip/hip_internal.cc#L951 or at https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/hip/hip_internal.cc#L1019

lightsighter · 2024-05-07T08:05:15Z

The test should take approximately 0.6s and I've run it for more than 30 minutes without getting any progress. So, it is running slowly, it is incredibly slow.

What makes you think it should run in 0.6s? Is that time from an NVIDIA machine?

mariodirenzo · 2024-05-07T10:06:36Z

What makes you think it should run in 0.6s? Is that time from an NVIDIA machine?

I'm running a lot of similar unit tests. Those that run to completion are executed in approximately 0.6s, which is also the time it takes to run the tests on NVIDIA machines

elliottslaughter · 2024-05-07T16:29:05Z

Let me see if I understand. On AMD GPUs, you have some unit tests that finish in 0.6 seconds, but this particular one (which is similar to at least some of the others) does not complete in 30+ minutes. (And all of the unit tests pass in a short amount of time on NVIDIA hardware.)

Assuming this is the case, I guess you could do some delta debugging to figure out what's unique or different about the freezing test. The smaller the test case (and the smaller the difference to another working test case), the more likely it is that we'll be able to spot the root cause.

eddy16112 · 2024-05-07T17:24:59Z

We can try to run the slow test on NVIDIA GPU with HIP_TARGET=CUDA to see if it is an issue of realm hip module or AMD driver.

lightsighter · 2024-05-07T17:46:31Z

Given the description of the symptoms and the backtraces above, I suspect what is happening is that you're hitting one of the un-optimized DMA pathways in the Realm HIP module. The Realm CUDA module has had significant work put into it by people at NVIDIA to optimize DMA transfers and push them into CUDA kernels where possible. A DMA transfer that used to do 1M cudaMemcpy calls and take multiple minutes now is turned into a single CUDA kernel that does 1M loads and stores and takes effectively zero time. Optimizations like that have not been done in the HIP module (and cannot be done by anyone on the Realm team at NVIDIA). The suggestion by @eddy16112 will give us a good indication if that is the case.

mariodirenzo · 2024-05-07T17:48:57Z

Let me see if I understand. On AMD GPUs, you have some unit tests that finish in 0.6 seconds, but this particular one (which is similar to at least some of the others) does not complete in 30+ minutes. (And all of the unit tests pass in a short amount of time on NVIDIA hardware.)

that's right.

We can try to run the slow test on NVIDIA GPU with HIP_TARGET=CUDA to see if it is an issue of realm hip module or AMD driver.

Sure, I'll try it

seemamirch · 2024-10-31T23:35:59Z

When running HTR++ unit tests on Tioga, some tests freeze without any error message. The freeze is deterministic and happens only when an AMD gpu is utilized. The backtraces of a hanging execution look like this:

* thread #1, name = 'averageTest.exe', stop reason = signal SIGSTOP
  * frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x00001555553e7200) at mutex.cc:265:26
  thread #2, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb7cb libc.so.6`ioctl + 11
    frame #1: 0x0000155545d2b748 libhsa-runtime64.so.1`kmtIoctl + 40
    frame #2: 0x0000155545d249d5 libhsa-runtime64.so.1`hsaKmtWaitOnMultipleEvents_Ext + 245
    frame #3: 0x0000155545cbc784 libhsa-runtime64.so.1`rocr::core::Signal::WaitAny(unsigned int, hsa_signal_s const*, hsa_signal_condition_t const*, long const*, unsigned long, hsa_wait_state_t, long*) + 4228
    frame #4: 0x0000155545c9822e libhsa-runtime64.so.1`rocr::AMD::hsa_amd_signal_wait_any(unsigned int, hsa_signal_s*, hsa_signal_condition_t*, long*, unsigned long, hsa_wait_state_t, long*) + 94
    frame #5: 0x0000155545cae0ea libhsa-runtime64.so.1`rocr::core::Runtime::AsyncEventsLoop(void*) + 154
    frame #6: 0x0000155545c68677 libhsa-runtime64.so.1`rocr::os::ThreadTrampoline(void*) + 23
    frame #7: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #8: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #3, name = 'averageTest.exe'
    frame #0: 0x0000155553d250f5 libamdhip64.so.6`bool roc::VirtualGPU::dispatchGenericAqlPacket<hsa_kernel_dispatch_packet_s>(hsa_kernel_dispatch_packet_s*, unsigned short, unsigned short, bool, unsigned long) + 645
    frame #1: 0x0000155553d22f67 libamdhip64.so.6`roc::VirtualGPU::submitKernelInternal(amd::NDRangeContainer const&, amd::Kernel const&, unsigned char const*, void*, unsigned int, amd::NDRangeKernelCommand*, hsa_kernel_dispatch_packet_s*) + 3751
    frame #2: 0x0000155553d4caa5 libamdhip64.so.6`roc::KernelBlitManager::fillBuffer1D(device::Memory&, void const*, unsigned long, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, bool, bool) const + 1317
    frame #3: 0x0000155553d451f6 libamdhip64.so.6`roc::KernelBlitManager::fillBuffer(device::Memory&, void const*, unsigned long, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, bool, bool) const + 550
    frame #4: 0x0000155553d1a432 libamdhip64.so.6`roc::VirtualGPU::fillMemory(unsigned int, amd::Memory*, void const*, unsigned long, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, bool) + 706
    frame #5: 0x0000155553d1c03f libamdhip64.so.6`roc::VirtualGPU::submitFillMemory(amd::FillMemoryCommand&) + 191
    frame #6: 0x0000155553cf3fe1 libamdhip64.so.6`amd::Command::enqueue() + 1137
    frame #7: 0x0000155553baf9e8 libamdhip64.so.6`ihipMemset3D(hipPitchedPtr, int, hipExtent, ihipStream_t*, bool) + 248
    frame #8: 0x0000155553bafb8b libamdhip64.so.6`hipMemset2DAsync_common(void*, unsigned long, int, unsigned long, unsigned long, ihipStream_t*) + 171
    frame #9: 0x0000155553bec222 libamdhip64.so.6`hipMemset2DAsync + 466
    frame #10: 0x000000000410a55a averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c98175190, channel=0x00000000054d3050, work_until=<unavailable>) at hip_internal.cc:953:19
    frame #11: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000054d3088, work_until=<unavailable>) at channel.inl:157:35
    frame #12: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155537ffc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
    frame #13: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x0000000004ea9f50) at bgwork.cc:103:22
    frame #14: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005287e00) at threads.cc:831:29
    frame #15: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #16: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #4, name = 'averageTest.exe'
    frame #0: 0x000015554e75b40b libc.so.6`sysmalloc + 379
    frame #1: 0x000015554e75c840 libc.so.6`_int_malloc + 3392
    frame #2: 0x000015554e75d972 libc.so.6`malloc + 498
    frame #3: 0x000015554ed35d7c libstdc++.so.6`operator new(unsigned long) + 28
    frame #4: 0x0000155553baf790 libamdhip64.so.6`ihipMemset3DCommand(std::vector<amd::Command*, std::allocator<amd::Command*>>&, hipPitchedPtr, int, hipExtent, hip::Stream*, unsigned long) + 368
    frame #5: 0x0000155553baf9c4 libamdhip64.so.6`ihipMemset3D(hipPitchedPtr, int, hipExtent, ihipStream_t*, bool) + 212
    frame #6: 0x0000155553bafb8b libamdhip64.so.6`hipMemset2DAsync_common(void*, unsigned long, int, unsigned long, unsigned long, ihipStream_t*) + 171
    frame #7: 0x0000155553bec222 libamdhip64.so.6`hipMemset2DAsync + 466
    frame #8: 0x000000000410a55a averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c983b4fb0, channel=0x00000000054d3050, work_until=<unavailable>) at hip_internal.cc:953:19
    frame #9: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000054d3088, work_until=<unavailable>) at channel.inl:157:35
    frame #10: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155437dfc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
    frame #11: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x0000000004eaa100) at bgwork.cc:103:22
    frame #12: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005287fb0) at threads.cc:831:29
    frame #13: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #14: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #5, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb1677200) at mutex.cc:265:26
    frame #2: 0x00000000040cc7f4 averageTest.exec`Realm::KernelThreadTaskScheduler::worker_sleep(this=0x0000000004f677d0, switch_to=0x0000154d300141c0) at tasks.cc:1469:17
    frame #3: 0x00000000040cd4ea averageTest.exec`Realm::ThreadedTaskScheduler::thread_blocking(this=0x0000000004f677d0, thread=<unavailable>) at tasks.cc:963:14
  thread #6, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb1472200) at mutex.cc:265:26
    frame #2: 0x00000000040c8c08 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(unsigned long) [inlined] Realm::ThreadedTaskScheduler::wait_for_work(old_work_counter=<unavailable>, this=0x00000000054fead0) at tasks.cc:1294:33
    frame #3: 0x00000000040c8c00 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(this=0x00000000054fead0, old_work_counter=<unavailable>) at tasks.cc:1275:8
    frame #4: 0x00000000040d0053 averageTest.exec`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x00000000054fead0) at tasks.cc:1260:21
  thread #7, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000155337bff200) at mutex.cc:265:26
  thread #8, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb126d200) at mutex.cc:265:26
    frame #2: 0x00000000040c8c08 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(unsigned long) [inlined] Realm::ThreadedTaskScheduler::wait_for_work(old_work_counter=<unavailable>, this=0x0000000004f677d0) at tasks.cc:1294:33
    frame #3: 0x00000000040c8c00 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(this=0x0000000004f677d0, old_work_counter=<unavailable>) at tasks.cc:1275:8
    frame #4: 0x00000000040d0053 averageTest.exec`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x0000000004f677d0) at tasks.cc:1260:21
  thread #9, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb7cb libc.so.6`ioctl + 11
    frame #1: 0x0000155545d2b748 libhsa-runtime64.so.1`kmtIoctl + 40
    frame #2: 0x0000155545d249d5 libhsa-runtime64.so.1`hsaKmtWaitOnMultipleEvents_Ext + 245
    frame #3: 0x0000155545d25187 libhsa-runtime64.so.1`hsaKmtWaitOnEvent_Ext + 39
    frame #4: 0x0000155545c9cfb5 libhsa-runtime64.so.1`rocr::core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) + 693
    frame #5: 0x0000155545c9ccba libhsa-runtime64.so.1`rocr::core::InterruptSignal::WaitAcquire(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) + 10
    frame #6: 0x0000155545c924d1 libhsa-runtime64.so.1`rocr::HSA::hsa_signal_wait_scacquire(hsa_signal_s, hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) + 209
    frame #7: 0x0000155553d3e575 libamdhip64.so.6`HostcallListener::consumePackets() + 133
    frame #8: 0x0000155553ae6e09 libamdhip64.so.6`amd::Thread::main() + 105
    frame #9: 0x0000155553cee8cf libamdhip64.so.6`amd::Thread::entry(amd::Thread*) + 79
    frame #10: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #11: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #10, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb0e63200) at mutex.cc:265:26
    frame #2: 0x00000000040cc7f4 averageTest.exec`Realm::KernelThreadTaskScheduler::worker_sleep(this=0x0000000004a00f40, switch_to=0x0000154c98249690) at tasks.cc:1469:17
    frame #3: 0x00000000040cd4ea averageTest.exec`Realm::ThreadedTaskScheduler::thread_blocking(this=0x0000000004a00f40, thread=<unavailable>) at tasks.cc:963:14
    frame #4: 0x0000000004108901 averageTest.exec`Realm::Hip::GPUfillChannel::create_xfer_des(this=<unavailable>, dma_op=<unavailable>, launch_node=<unavailable>, guid=<unavailable>, inputs_info=<unavailable>, outputs_info=<unavailable>, priority=-1740931136, redop_info=(id = -1740931136, is_fold = true, in_place = true, is_exclusive = false), fill_data=0x0000154c983b7fc0, fill_size=1, fill_total=7236833163134198317) at hip_internal.cc:1105:7
  thread #11, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb187c200) at mutex.cc:265:26
    frame #2: 0x00000000040c8c08 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(unsigned long) [inlined] Realm::ThreadedTaskScheduler::wait_for_work(old_work_counter=<unavailable>, this=0x0000000004a00f40) at tasks.cc:1294:33
    frame #3: 0x00000000040c8c00 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(this=0x0000000004a00f40, old_work_counter=<unavailable>) at tasks.cc:1275:8
    frame #4: 0x00000000040d0053 averageTest.exec`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x0000000004a00f40) at tasks.cc:1260:21

Do you have any advice on what might be going wrong?

@elliottslaughter, can you please add this issue to #1032?

The is due to a bug in realm's hip fill code i.e. out_alc and total_bytes are not updated in this code block
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/hip/hip_internal.cc?ref_type=heads#L1012
resulting in 'done' always being false.
HTR test passes after adding the following to the code block

                      out_alc.advance(2, planes);
                      total_bytes += bytes * lines * planes;

elliottslaughter · 2024-11-01T06:47:23Z

Can we get an MR for this? Ideally would be nice to commit a test for it as well, since we apparently don't cover this in our CI.

seemamirch · 2024-11-06T19:27:55Z

Can we get an MR for this? Ideally would be nice to commit a test for it as well, since we apparently don't cover this in our CI.

https://gitlab.com/StanfordLegion/legion/-/merge_requests/1531

seemamirch · 2024-11-07T18:51:55Z

merged into master

elliottslaughter mentioned this issue May 1, 2024

Prioritized list of Regent features for HTR (PSAAP) #1032

Open

82 tasks

mariodirenzo mentioned this issue Jul 16, 2024

Realm: error when compiling with HIP with NVIDIA GPUs #1713

Open

seemamirch closed this as completed Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock when running with AMD GPUs #1688

Deadlock when running with AMD GPUs #1688

mariodirenzo commented Apr 30, 2024

lightsighter commented Apr 30, 2024

elliottslaughter commented Apr 30, 2024

mariodirenzo commented Apr 30, 2024

elliottslaughter commented Apr 30, 2024

mariodirenzo commented May 4, 2024

elliottslaughter commented May 4, 2024

mariodirenzo commented May 6, 2024

lightsighter commented May 7, 2024

mariodirenzo commented May 7, 2024

elliottslaughter commented May 7, 2024

eddy16112 commented May 7, 2024

lightsighter commented May 7, 2024

mariodirenzo commented May 7, 2024

seemamirch commented Oct 31, 2024

elliottslaughter commented Nov 1, 2024

seemamirch commented Nov 6, 2024 •

edited by elliottslaughter

Loading

seemamirch commented Nov 7, 2024

Deadlock when running with AMD GPUs #1688

Deadlock when running with AMD GPUs #1688

Comments

mariodirenzo commented Apr 30, 2024

lightsighter commented Apr 30, 2024

elliottslaughter commented Apr 30, 2024

mariodirenzo commented Apr 30, 2024

elliottslaughter commented Apr 30, 2024

mariodirenzo commented May 4, 2024

elliottslaughter commented May 4, 2024

mariodirenzo commented May 6, 2024

lightsighter commented May 7, 2024

mariodirenzo commented May 7, 2024

elliottslaughter commented May 7, 2024

eddy16112 commented May 7, 2024

lightsighter commented May 7, 2024

mariodirenzo commented May 7, 2024

seemamirch commented Oct 31, 2024

elliottslaughter commented Nov 1, 2024

seemamirch commented Nov 6, 2024 • edited by elliottslaughter Loading

seemamirch commented Nov 7, 2024

seemamirch commented Nov 6, 2024 •

edited by elliottslaughter

Loading