Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in GPU executions #1550

Closed
Tracked by #1032
cmelone opened this issue Sep 12, 2023 · 48 comments
Closed
Tracked by #1032

Regression in GPU executions #1550

cmelone opened this issue Sep 12, 2023 · 48 comments
Assignees

Comments

@cmelone
Copy link
Contributor

cmelone commented Sep 12, 2023

Our GPU test suite last succeeded at Legion commit 61a919f8 and failed last night at da9cefee. I suspect it's due to this merge?

We're getting errors like:

prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/5065111388/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/5065111388/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
[0 - 7f55f347ec80]    2.785033 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
[0 - 7f55e16f8c80]    2.785032 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
[0 - 7f55f348ac80]    2.785078 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)

I can debug further, just let me know.

@elliottslaughter, could you please add this to #1032?

@streichler streichler self-assigned this Sep 12, 2023
@streichler
Copy link
Contributor

If this is a single-node run, can you try running inside of cuda-gdb and see if you get better information about the
device-side asserts?

@apryakhin
Copy link
Contributor

I can take a look at this

@apryakhin
Copy link
Contributor

@cmelone Can you walk me through the steps to get a repro?

@cmelone
Copy link
Contributor Author

cmelone commented Sep 12, 2023

cc: @mariodirenzo

@mariodirenzo
Copy link

@elliottslaughter, Can you please add this issue to #1032?

@apryakhin
Copy link
Contributor

We've got a reproducer. I am going to root cause it as soon as possible

@apryakhin
Copy link
Contributor

The fix is out. I am going to be submitting it soon once reviewed.

@mariodirenzo
Copy link

@streichler, do you have any update regarding the merge of Artem's patch into master?

@apryakhin
Copy link
Contributor

Yes, we worked through several comments which have been addressed:
https://gitlab.com/StanfordLegion/legion/-/merge_requests/926

Depending on the available bandwidth I am going to ask few team members to review & approve, so we could submit the patch asap.

@apryakhin
Copy link
Contributor

The fix has been merged

@cmelone
Copy link
Contributor Author

cmelone commented Sep 21, 2023

Thanks @apryakhin! I'll let our CI run and get back to you

@mariodirenzo
Copy link

I've run a quick test on sapling2 and it seems that the current version of control_replication still produces the same failures as before.

@lightsighter
Copy link
Contributor

Just responding here, that this commit was merged upstream into control replication as of yesterday:

commit d55d7c17c80ad6615be9339a06e87d8de89bedce
Merge: ee49c0001 fac5e48bf
Author: apryakhin <[email protected]>
Date:   Thu Sep 21 18:23:30 2023 +0000

    Merge branch 'realm-fix-htr' into 'master'
    
    Realm: fix detection of transpose use cases
    
    See merge request StanfordLegion/legion!926
    ```

@cmelone
Copy link
Contributor Author

cmelone commented Sep 26, 2023

I am on 3a9a37a and still getting similar errors.

CU: CUDA_DRIVER_FNPTR(cuMemcpy3DAsync)(&cuda_copy, stream->get_stream()) = 1 (CUDA_ERROR_INVALID_VALUE): invalid argument
...
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/5163946925/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
[0 - 7f3a03bffc80]    0.545906 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
...
[0 - 7f9f3ad94c80]    1.620636 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
CU: cuLaunchKernel(f, grid_dim.x, grid_dim.y, grid_dim.z, block_dim.x, block_dim.y, block_dim.z, shared_memory, stream, args, NULL) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered

(these are all errors I'm seeing across different runs)

@apryakhin
Copy link
Contributor

I am waiting for CI on this patch:

Verified by manual testing on sapling2

...
.../htr3/unitTests/metricTest$ rm -rf PeriodicDir; srun -p gpu -N 1 ./metricTest_Periodic.exec -ll:gpu 1
...
metricTest_Periodic: TEST OK!

That will be merged as soon as possible

@apryakhin
Copy link
Contributor

The patch is LGTMd and set for auto-merge once CI passes:

I did a number of runs on sapling2 along with @mariodirenzo and can confirm that it fixes the issue.

@mariodirenzo
Copy link

So we have a new problem that I am not sure is related to the cuda-dma.
The code works fine if everything (Legion + HTR) is compiled in debug mode.
As soon as things are compiled in release mode, it starts producing the following errors

/home/mariodr/legion3/runtime/realm/inst_layout.inl:1820: FT *Realm::MultiAffineAccessor<FT, N, T>::ptr(const Realm::Point<N, T> &) [with FT = MyArray<double, 3>; int N = 3; T = long long]: block: [0,0,0], thread: [0,7,7] Assertion `i->opcode() == PieceLookup::Opcodes::OP_SPLIT1` failed.

I did not catch this problem yesterday because I was testing the code in debug mode.

Any suggestion on how to move forward with the debug?

@apryakhin
Copy link
Contributor

apryakhin commented Sep 29, 2023

Please file another bug for that and we will triage it separately

@apryakhin
Copy link
Contributor

This github issue is actually about the GPU regression, so I think we should still triage it here. Would that be possible to get a backtrace on this? Just by looking at the assert we are hitting I am unable to determine the root cause. I plan to be OOO today by can look as soon as have an online access (hopefully over the weekend)

@streichler
Copy link
Contributor

The most common cause for that (admittedly fairly unhelpful) error message is that you are
attempting an access on a point that is outside the bounds of the multi-affine instance. Can
you try running with bounds checks enabled and/or break in with a debugger to see what
point the unhappy thread is trying to access and then determine whether that point is expected
to be present in the instance?

@mariodirenzo
Copy link

The executions in debug mode are with bounds checks and I haven't seen any failure on those. Moreover the same version of the code used to work fine on 61a919f8.
The main option that I see is that some corrupted data is reaching the kernel when things are faster in release mode. Since we compute the stencil point based on one data field, if the field is corrupted, the offset to the stencil point will be wrong

@streichler
Copy link
Contributor

That makes sense. So although there's no specific evidence to suspect this is due to the cuda-dma bug,
there's definitely some chance it has the same root cause. Did we think we had all of the cuda-dma-related
fixes in this run?

@mariodirenzo
Copy link

yes. The other day I told Artem that the issue was fixed because I tested the code in debug mode, which works fine. As soon as Caetano ran the CI in release mode the new error message appeared.

@mariodirenzo
Copy link

@apryakhin has been able to find the root cause of the regression in the cuda-dma changes. The branch https://gitlab.com/StanfordLegion/legion/-/commits/cuda-dma-fix-2 contains the fix.

@cmelone is completing a thorough test of the fix using the https://gitlab.com/StanfordLegion/legion/-/commits/control_replication_staggered_fix?ref_type=heads branch. Preliminary results are promising.

@elliottslaughter
Copy link
Contributor

@apryakhin when you find the root cause, are you adding tests to our Realm test suite so we catch issues like these in the future? By all means, merge the fix once everyone here approves, but I'm concerned that these failures demonstrate a hole in our test coverage, and we'll relapse later if we don't improve the tests as well.

@apryakhin
Copy link
Contributor

@elliottslaughter Yes, the fix will come with realm tests to make sure we aren't running into this issues in a future. I do agree that we are missing on a test coverage here. I am going to prioritize and address it going forwards before merging any more of cuda-dma changes.

@cmelone
Copy link
Contributor Author

cmelone commented Oct 4, 2023

Everything seems to be working now due to the fix. Can you let us know when it's merged into control replication, and I'll close the issue?

Thanks again

@apryakhin
Copy link
Contributor

Yes, will update it here as soon as merge the fix in

@apryakhin
Copy link
Contributor

The fix has been merged:

@mariodirenzo
Copy link

@lightsighter, can you please merge the fix into control replication?

@lightsighter
Copy link
Contributor

I've upstreamed master into control replication.

@cmelone
Copy link
Contributor Author

cmelone commented Oct 6, 2023

With the newest version of control replication, a smaller subset of our tests failed. Since we previously verified that @apryakhin's branch didn't throw any errors with our tests, another CUDA bug must have been introduced since.

I'm guessing that this commit could be the cause?

We're getting errors like

/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
[0 - 7f0e42dd5c80]    4.777734 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)

@lightsighter
Copy link
Contributor

I'm guessing that this commit could be the cause?

No, that commit is a stand-alone new feature. It doesn't change any existing Realm functionality. @muraj to confirm, but I feel pretty confident in that assessment.

With the newest version of control replication, a smaller subset of our tests failed. Since we previously verified that @apryakhin's branch didn't throw any errors with our tests, another CUDA bug must have been introduced since.

Or something changed on the machine or in the way you're testing things. Unfortunately @apryakhin is on vacation next week, so we might need to wait for him to come back to assess. Can you run with the branch again and confirm that it still works without any errors for you?

@apryakhin
Copy link
Contributor

@cmelone Can you please confirm whether this branch:
https://gitlab.com/StanfordLegion/legion/-/commits/control_replication_staggered_fix?ref_type=heads

is passing or not? As well as how run and get a reproducer on control_replication if possible

@cmelone
Copy link
Contributor Author

cmelone commented Oct 9, 2023

Apologies for the false alarm, Cory.

@lightsighter, I did some more testing.

When running Legion @ commit 0cbee456 and cherry-picking Artem's fix in bb388fa1, our tests pass.

When I move that same configuration to commit 8ccfc138 (merge of betterend), the tests reliably fail.

Note that this only happens when we run with CUDA. Let me know if you'd like me to move this to another issue and/or if you need more info.

@lightsighter
Copy link
Contributor

When I move that same configuration to commit 8ccfc13 (merge of betterend), the tests reliably fail.

Pull and try again. If it still fails then make me a small reproducer on sapling.

Let me know if you'd like me to move this to another issue and/or if you need more info.

You can leave it here for now.

@cmelone
Copy link
Contributor Author

cmelone commented Oct 9, 2023

cd /home/cmelone/gpu-oct
# execute
REBUILD=0 ./run.sh
# re-compile
REBUILD=1 ./run.sh

Note: everything is compiled in debug mode. Once the execution is started, the process freezes pretty much immediately.

@lightsighter
Copy link
Contributor

This doesn't appear to be hanging to me. There are real tasks running and making progress. It looks like you built with Regent's bounds checks on so the tasks are very slow as all their memory references are having to go through safe casts, but it is definitely not hanging.

@cmelone
Copy link
Contributor Author

cmelone commented Oct 9, 2023

Maybe the freezing is more stochastic than I thought. Anyway, here is a frozen process

[0 - 7f401fffec80]   36.339152 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
prometeo_ConstPropMix.exec: /home/cmelone/gpu-oct/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
Legion process received signal 6: Aborted
Process 1325936 on node g0004.stanford.edu is frozen!

I set the process to run for 24 hours.

@lightsighter
Copy link
Contributor

Maybe the freezing is more stochastic than I thought. Anyway, here is a frozen process

Ok, so it's not actually freezing. It is crashing and you've set REALM_FREEZE_ON_ERROR.

Anyway, here is a frozen process

So this does look like one of the crashes from before not something related to the merge of betterend. The merge of betterend probably did perturb the timing though of some things which may have shaken loose some additional issues. This was run with betterend and with all of Artem's most recent changes up-streamed?

@cmelone
Copy link
Contributor Author

cmelone commented Oct 10, 2023

Thanks for the clarifications.

Let me outline my testing process:

  1. Compile Legion and HTR with Artem's branch (control_replication_staggered_fix). This does not pose any issues.
  2. Merge betterend into control_replication_staggered_fix and re-compile. CUDA errors like device-side assert triggered reappear. I will say that the original bug affected a majority of our tests, this new error is only triggered for 1/13 of them.

Note: I also tested with the latest version of control replication (including your fix), and the issues persist.

You're right, the merge of this branch may have exposed additional issues that weren't tested before.

Let me know if I can provide more details.

@lightsighter
Copy link
Contributor

You're right, the merge of this branch may have exposed additional issues that weren't tested before.

betterend relaxed the timing of some things inside Legion which might have enabled some more parallelism to occur.

CUDA errors like device-side assert triggered reappear.

If you build all the device side code with with -lineinfo (for both application device code and Realm DMA code) then I think it should give you a line number for the device side assert that is being triggered. If it doesn't, then maybe see if it reproduces with cuda-memcheck which should give you a line number for a device side assert.

@cmelone
Copy link
Contributor Author

cmelone commented Oct 12, 2023

The code is failing on an assertion of the application that signals that bad data is being produced somewhere. The major difference compared to the previous mode of failure is that data was corrupted everywhere, resulting in all tests failing. However, in this instance, only select configurations are failing. We tried running with Legion Spy and a failing run passes passed the logical and physical analysis

@lightsighter
Copy link
Contributor

So that at least rules out the runtime analysis. Do you use futures for passing any data around or doing future reductions?

@cmelone
Copy link
Contributor Author

cmelone commented Oct 12, 2023 via email

@lightsighter
Copy link
Contributor

Can you check that some of your future data is the same?

@lightsighter
Copy link
Contributor

Ok, I think I pushed a fix for this issue. @cmelone please pull the latest control replication branch and try again.

@cmelone
Copy link
Contributor Author

cmelone commented Oct 17, 2023

thank you for the assistance, Mike and Artem!

@cmelone cmelone closed this as completed Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants