-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when running with AMD GPUs #1688
Comments
Presuming these backtraces are not changing over time, this is guaranteed to be a bug in AMD's driver. It should never be possible for a thread to be stuck in here:
All calls into ROCm should always return in finite time. |
I agree with @lightsighter's assessment that this is likely a ROCm bug, or at least an issue with how ROCm is configured. @mariodirenzo can you tell us more about your configuration?
I know there are some variables related to resources assigned to each process that by default are not configured in an optimal way for Legion. |
6.0.3
This is a node of Tioga (https://hpc.llnl.gov/hardware/compute-platforms/tioga), which has 4 GPUs
I'm using one process with one GPU |
Is this a C++ code? Because Regent doesn't support that ROCm version. For what it's worth, we've been hitting a lot of ROCm issues with S3D, though our symptoms are different (crashes with an out of resource message, rather than hangs). The advice we've been given so far has been to test three things:
Overall, we are probably in territory where it would be appropriate to contact Tioga support and ideally get AMD involved in helping you debug this issue. |
Yes, this is C++ only
this didn't make any difference. I've also noticed that the bt of thread 4 is changing. Sometimes I get
sometimes I get
|
If these backtraces are changing within a single run, that would indicate that the code is not deadlocked but is running very slowly. I don't know if this is still applicable, but at one point fills on HIP we're known to be extremely slow: #1236 I haven't had the opportunity to check any recent HIP versions to see if it got fixed, but that seems like a relatively self contained test you could do. |
I'm not sure about it. The test should take approximately 0.6s and I've run it for more than 30 minutes without getting any progress. So, it is running slowly, it is incredibly slow. Every time I extract a backtrace, I see thread 4 in this function either at this line https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/hip/hip_internal.cc#L951 or at https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/hip/hip_internal.cc#L1019 |
What makes you think it should run in 0.6s? Is that time from an NVIDIA machine? |
I'm running a lot of similar unit tests. Those that run to completion are executed in approximately |
Let me see if I understand. On AMD GPUs, you have some unit tests that finish in 0.6 seconds, but this particular one (which is similar to at least some of the others) does not complete in 30+ minutes. (And all of the unit tests pass in a short amount of time on NVIDIA hardware.) Assuming this is the case, I guess you could do some delta debugging to figure out what's unique or different about the freezing test. The smaller the test case (and the smaller the difference to another working test case), the more likely it is that we'll be able to spot the root cause. |
We can try to run the slow test on NVIDIA GPU with HIP_TARGET=CUDA to see if it is an issue of realm hip module or AMD driver. |
Given the description of the symptoms and the backtraces above, I suspect what is happening is that you're hitting one of the un-optimized DMA pathways in the Realm HIP module. The Realm CUDA module has had significant work put into it by people at NVIDIA to optimize DMA transfers and push them into CUDA kernels where possible. A DMA transfer that used to do 1M cudaMemcpy calls and take multiple minutes now is turned into a single CUDA kernel that does 1M loads and stores and takes effectively zero time. Optimizations like that have not been done in the HIP module (and cannot be done by anyone on the Realm team at NVIDIA). The suggestion by @eddy16112 will give us a good indication if that is the case. |
that's right.
Sure, I'll try it |
The is due to a bug in realm's hip fill code i.e. out_alc and total_bytes are not updated in this code block
|
Can we get an MR for this? Ideally would be nice to commit a test for it as well, since we apparently don't cover this in our CI. |
https://gitlab.com/StanfordLegion/legion/-/merge_requests/1531 |
merged into master |
When running HTR++ unit tests on Tioga, some tests freeze without any error message.
The freeze is deterministic and happens only when an AMD gpu is utilized.
The backtraces of a hanging execution look like this:
Do you have any advice on what might be going wrong?
@elliottslaughter, can you please add this issue to #1032?
The text was updated successfully, but these errors were encountered: