-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in GPU executions #1550
Comments
If this is a single-node run, can you try running inside of |
I can take a look at this |
@cmelone Can you walk me through the steps to get a repro? |
cc: @mariodirenzo |
@elliottslaughter, Can you please add this issue to #1032? |
We've got a reproducer. I am going to root cause it as soon as possible |
The fix is out. I am going to be submitting it soon once reviewed. |
@streichler, do you have any update regarding the merge of Artem's patch into master? |
Yes, we worked through several comments which have been addressed: Depending on the available bandwidth I am going to ask few team members to review & approve, so we could submit the patch asap. |
The fix has been merged |
Thanks @apryakhin! I'll let our CI run and get back to you |
I've run a quick test on sapling2 and it seems that the current version of control_replication still produces the same failures as before. |
Just responding here, that this commit was merged upstream into control replication as of yesterday:
|
I am on
(these are all errors I'm seeing across different runs) |
I am waiting for CI on this patch: Verified by manual testing on sapling2
That will be merged as soon as possible |
The patch is LGTMd and set for auto-merge once CI passes: I did a number of runs on sapling2 along with @mariodirenzo and can confirm that it fixes the issue. |
So we have a new problem that I am not sure is related to the cuda-dma.
I did not catch this problem yesterday because I was testing the code in debug mode. Any suggestion on how to move forward with the debug? |
Please file another bug for that and we will triage it separately |
This github issue is actually about the GPU regression, so I think we should still triage it here. Would that be possible to get a backtrace on this? Just by looking at the assert we are hitting I am unable to determine the root cause. I plan to be OOO today by can look as soon as have an online access (hopefully over the weekend) |
The most common cause for that (admittedly fairly unhelpful) error message is that you are |
The executions in debug mode are with bounds checks and I haven't seen any failure on those. Moreover the same version of the code used to work fine on |
That makes sense. So although there's no specific evidence to suspect this is due to the cuda-dma bug, |
yes. The other day I told Artem that the issue was fixed because I tested the code in debug mode, which works fine. As soon as Caetano ran the CI in release mode the new error message appeared. |
@apryakhin has been able to find the root cause of the regression in the cuda-dma changes. The branch https://gitlab.com/StanfordLegion/legion/-/commits/cuda-dma-fix-2 contains the fix. @cmelone is completing a thorough test of the fix using the https://gitlab.com/StanfordLegion/legion/-/commits/control_replication_staggered_fix?ref_type=heads branch. Preliminary results are promising. |
@apryakhin when you find the root cause, are you adding tests to our Realm test suite so we catch issues like these in the future? By all means, merge the fix once everyone here approves, but I'm concerned that these failures demonstrate a hole in our test coverage, and we'll relapse later if we don't improve the tests as well. |
@elliottslaughter Yes, the fix will come with realm tests to make sure we aren't running into this issues in a future. I do agree that we are missing on a test coverage here. I am going to prioritize and address it going forwards before merging any more of cuda-dma changes. |
Everything seems to be working now due to the fix. Can you let us know when it's merged into control replication, and I'll close the issue? Thanks again |
Yes, will update it here as soon as merge the fix in |
The fix has been merged: |
@lightsighter, can you please merge the fix into control replication? |
I've upstreamed master into control replication. |
With the newest version of control replication, a smaller subset of our tests failed. Since we previously verified that @apryakhin's branch didn't throw any errors with our tests, another CUDA bug must have been introduced since. I'm guessing that this commit could be the cause? We're getting errors like
|
No, that commit is a stand-alone new feature. It doesn't change any existing Realm functionality. @muraj to confirm, but I feel pretty confident in that assessment.
Or something changed on the machine or in the way you're testing things. Unfortunately @apryakhin is on vacation next week, so we might need to wait for him to come back to assess. Can you run with the branch again and confirm that it still works without any errors for you? |
@cmelone Can you please confirm whether this branch: is passing or not? As well as how run and get a reproducer on |
Apologies for the false alarm, Cory. @lightsighter, I did some more testing. When running Legion @ commit When I move that same configuration to commit Note that this only happens when we run with CUDA. Let me know if you'd like me to move this to another issue and/or if you need more info. |
Pull and try again. If it still fails then make me a small reproducer on sapling.
You can leave it here for now. |
Note: everything is compiled in debug mode. Once the execution is started, the process freezes pretty much immediately. |
This doesn't appear to be hanging to me. There are real tasks running and making progress. It looks like you built with Regent's bounds checks on so the tasks are very slow as all their memory references are having to go through safe casts, but it is definitely not hanging. |
Maybe the freezing is more stochastic than I thought. Anyway, here is a frozen process
I set the process to run for 24 hours. |
Ok, so it's not actually freezing. It is crashing and you've set
So this does look like one of the crashes from before not something related to the merge of |
Thanks for the clarifications. Let me outline my testing process:
Note: I also tested with the latest version of control replication (including your fix), and the issues persist. You're right, the merge of this branch may have exposed additional issues that weren't tested before. Let me know if I can provide more details. |
If you build all the device side code with with |
The code is failing on an assertion of the application that signals that bad data is being produced somewhere. The major difference compared to the previous mode of failure is that data was corrupted everywhere, resulting in all tests failing. However, in this instance, only select configurations are failing. We tried running with Legion Spy and a failing run passes passed the logical and physical analysis |
So that at least rules out the runtime analysis. Do you use futures for passing any data around or doing future reductions? |
Yes, we do.On Oct 12, 2023, at 12:32 PM, Mike Bauer ***@***.***> wrote:
So that at least rules out the runtime analysis. Do you use futures for passing any data around or doing future reductions?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Can you check that some of your future data is the same? |
Ok, I think I pushed a fix for this issue. @cmelone please pull the latest control replication branch and try again. |
thank you for the assistance, Mike and Artem! |
Our GPU test suite last succeeded at Legion commit
61a919f8
and failed last night atda9cefee
. I suspect it's due to this merge?We're getting errors like:
I can debug further, just let me know.
@elliottslaughter, could you please add this to #1032?
The text was updated successfully, but these errors were encountered: