Regression in GPU executions #1550

cmelone · 2023-09-12T12:59:35Z

Our GPU test suite last succeeded at Legion commit 61a919f8 and failed last night at da9cefee. I suspect it's due to this merge?

We're getting errors like:

prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/5065111388/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/5065111388/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
[0 - 7f55f347ec80]    2.785033 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
[0 - 7f55e16f8c80]    2.785032 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
[0 - 7f55f348ac80]    2.785078 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)

I can debug further, just let me know.

@elliottslaughter, could you please add this to #1032?

The text was updated successfully, but these errors were encountered:

streichler · 2023-09-12T13:31:50Z

If this is a single-node run, can you try running inside of cuda-gdb and see if you get better information about the
device-side asserts?

apryakhin · 2023-09-12T13:36:23Z

I can take a look at this

apryakhin · 2023-09-12T14:53:32Z

@cmelone Can you walk me through the steps to get a repro?

cmelone · 2023-09-12T15:12:58Z

cc: @mariodirenzo

mariodirenzo · 2023-09-12T15:50:01Z

@elliottslaughter, Can you please add this issue to #1032?

apryakhin · 2023-09-12T18:09:22Z

We've got a reproducer. I am going to root cause it as soon as possible

apryakhin · 2023-09-13T09:14:17Z

The fix is out. I am going to be submitting it soon once reviewed.

mariodirenzo · 2023-09-19T06:47:35Z

@streichler, do you have any update regarding the merge of Artem's patch into master?

apryakhin · 2023-09-19T09:08:44Z

Yes, we worked through several comments which have been addressed:
https://gitlab.com/StanfordLegion/legion/-/merge_requests/926

Depending on the available bandwidth I am going to ask few team members to review & approve, so we could submit the patch asap.

apryakhin · 2023-09-21T18:24:05Z

The fix has been merged

cmelone · 2023-09-21T18:29:01Z

Thanks @apryakhin! I'll let our CI run and get back to you

mariodirenzo · 2023-09-26T08:38:48Z

I've run a quick test on sapling2 and it seems that the current version of control_replication still produces the same failures as before.

lightsighter · 2023-09-26T19:50:25Z

Just responding here, that this commit was merged upstream into control replication as of yesterday:

commit d55d7c17c80ad6615be9339a06e87d8de89bedce
Merge: ee49c0001 fac5e48bf
Author: apryakhin <[email protected]>
Date:   Thu Sep 21 18:23:30 2023 +0000

    Merge branch 'realm-fix-htr' into 'master'
    
    Realm: fix detection of transpose use cases
    
    See merge request StanfordLegion/legion!926
    ```

cmelone · 2023-09-26T20:12:02Z

I am on 3a9a37a and still getting similar errors.

CU: CUDA_DRIVER_FNPTR(cuMemcpy3DAsync)(&cuda_copy, stream->get_stream()) = 1 (CUDA_ERROR_INVALID_VALUE): invalid argument
...
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/5163946925/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
[0 - 7f3a03bffc80]    0.545906 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
...
[0 - 7f9f3ad94c80]    1.620636 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
CU: cuLaunchKernel(f, grid_dim.x, grid_dim.y, grid_dim.z, block_dim.x, block_dim.y, block_dim.z, shared_memory, stream, args, NULL) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered

(these are all errors I'm seeing across different runs)

apryakhin · 2023-09-27T15:08:20Z

I am waiting for CI on this patch:

https://gitlab.com/StanfordLegion/legion/-/merge_requests/941

Verified by manual testing on sapling2

...
.../htr3/unitTests/metricTest$ rm -rf PeriodicDir; srun -p gpu -N 1 ./metricTest_Periodic.exec -ll:gpu 1
...
metricTest_Periodic: TEST OK!

That will be merged as soon as possible

apryakhin · 2023-09-27T20:09:35Z

The patch is LGTMd and set for auto-merge once CI passes:

https://gitlab.com/StanfordLegion/legion/-/merge_requests/941

I did a number of runs on sapling2 along with @mariodirenzo and can confirm that it fixes the issue.

mariodirenzo · 2023-09-29T07:15:43Z

So we have a new problem that I am not sure is related to the cuda-dma.
The code works fine if everything (Legion + HTR) is compiled in debug mode.
As soon as things are compiled in release mode, it starts producing the following errors

/home/mariodr/legion3/runtime/realm/inst_layout.inl:1820: FT *Realm::MultiAffineAccessor<FT, N, T>::ptr(const Realm::Point<N, T> &) [with FT = MyArray<double, 3>; int N = 3; T = long long]: block: [0,0,0], thread: [0,7,7] Assertion `i->opcode() == PieceLookup::Opcodes::OP_SPLIT1` failed.

I did not catch this problem yesterday because I was testing the code in debug mode.

Any suggestion on how to move forward with the debug?

apryakhin · 2023-09-29T07:17:15Z

Please file another bug for that and we will triage it separately

apryakhin · 2023-09-29T07:50:43Z

This github issue is actually about the GPU regression, so I think we should still triage it here. Would that be possible to get a backtrace on this? Just by looking at the assert we are hitting I am unable to determine the root cause. I plan to be OOO today by can look as soon as have an online access (hopefully over the weekend)

streichler · 2023-09-29T15:41:21Z

The most common cause for that (admittedly fairly unhelpful) error message is that you are
attempting an access on a point that is outside the bounds of the multi-affine instance. Can
you try running with bounds checks enabled and/or break in with a debugger to see what
point the unhappy thread is trying to access and then determine whether that point is expected
to be present in the instance?

mariodirenzo · 2023-09-29T15:47:25Z

The executions in debug mode are with bounds checks and I haven't seen any failure on those. Moreover the same version of the code used to work fine on 61a919f8.
The main option that I see is that some corrupted data is reaching the kernel when things are faster in release mode. Since we compute the stencil point based on one data field, if the field is corrupted, the offset to the stencil point will be wrong

streichler · 2023-09-29T15:51:50Z

That makes sense. So although there's no specific evidence to suspect this is due to the cuda-dma bug,
there's definitely some chance it has the same root cause. Did we think we had all of the cuda-dma-related
fixes in this run?

mariodirenzo · 2023-09-29T15:54:36Z

yes. The other day I told Artem that the issue was fixed because I tested the code in debug mode, which works fine. As soon as Caetano ran the CI in release mode the new error message appeared.

mariodirenzo · 2023-10-03T09:16:12Z

@apryakhin has been able to find the root cause of the regression in the cuda-dma changes. The branch https://gitlab.com/StanfordLegion/legion/-/commits/cuda-dma-fix-2 contains the fix.

@cmelone is completing a thorough test of the fix using the https://gitlab.com/StanfordLegion/legion/-/commits/control_replication_staggered_fix?ref_type=heads branch. Preliminary results are promising.

elliottslaughter · 2023-10-03T17:47:21Z

@apryakhin when you find the root cause, are you adding tests to our Realm test suite so we catch issues like these in the future? By all means, merge the fix once everyone here approves, but I'm concerned that these failures demonstrate a hole in our test coverage, and we'll relapse later if we don't improve the tests as well.

apryakhin · 2023-10-03T18:27:58Z

@elliottslaughter Yes, the fix will come with realm tests to make sure we aren't running into this issues in a future. I do agree that we are missing on a test coverage here. I am going to prioritize and address it going forwards before merging any more of cuda-dma changes.

cmelone · 2023-10-04T16:28:56Z

Everything seems to be working now due to the fix. Can you let us know when it's merged into control replication, and I'll close the issue?

Thanks again

apryakhin · 2023-10-04T17:01:38Z

Yes, will update it here as soon as merge the fix in

apryakhin · 2023-10-05T09:33:15Z

The fix has been merged:

https://gitlab.com/StanfordLegion/legion/-/merge_requests/949

mariodirenzo · 2023-10-05T11:49:23Z

@lightsighter, can you please merge the fix into control replication?

lightsighter · 2023-10-05T16:58:08Z

I've upstreamed master into control replication.

cmelone · 2023-10-06T16:44:33Z

With the newest version of control replication, a smaller subset of our tests failed. Since we previously verified that @apryakhin's branch didn't throw any errors with our tests, another CUDA bug must have been introduced since.

I'm guessing that this commit could be the cause?

We're getting errors like

/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
[0 - 7f0e42dd5c80]    4.777734 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)

lightsighter · 2023-10-07T04:55:12Z

I'm guessing that this commit could be the cause?

No, that commit is a stand-alone new feature. It doesn't change any existing Realm functionality. @muraj to confirm, but I feel pretty confident in that assessment.

With the newest version of control replication, a smaller subset of our tests failed. Since we previously verified that @apryakhin's branch didn't throw any errors with our tests, another CUDA bug must have been introduced since.

Or something changed on the machine or in the way you're testing things. Unfortunately @apryakhin is on vacation next week, so we might need to wait for him to come back to assess. Can you run with the branch again and confirm that it still works without any errors for you?

apryakhin · 2023-10-07T07:05:29Z

@cmelone Can you please confirm whether this branch:
https://gitlab.com/StanfordLegion/legion/-/commits/control_replication_staggered_fix?ref_type=heads

is passing or not? As well as how run and get a reproducer on control_replication if possible

cmelone · 2023-10-09T04:53:36Z

Apologies for the false alarm, Cory.

@lightsighter, I did some more testing.

When running Legion @ commit 0cbee456 and cherry-picking Artem's fix in bb388fa1, our tests pass.

When I move that same configuration to commit 8ccfc138 (merge of betterend), the tests reliably fail.

Note that this only happens when we run with CUDA. Let me know if you'd like me to move this to another issue and/or if you need more info.

lightsighter · 2023-10-09T06:44:26Z

When I move that same configuration to commit 8ccfc13 (merge of betterend), the tests reliably fail.

Pull and try again. If it still fails then make me a small reproducer on sapling.

Let me know if you'd like me to move this to another issue and/or if you need more info.

You can leave it here for now.

cmelone · 2023-10-09T22:41:23Z

cd /home/cmelone/gpu-oct
# execute
REBUILD=0 ./run.sh
# re-compile
REBUILD=1 ./run.sh

Note: everything is compiled in debug mode. Once the execution is started, the process freezes pretty much immediately.

lightsighter · 2023-10-09T23:49:56Z

This doesn't appear to be hanging to me. There are real tasks running and making progress. It looks like you built with Regent's bounds checks on so the tasks are very slow as all their memory references are having to go through safe casts, but it is definitely not hanging.

cmelone · 2023-10-09T23:57:15Z

Maybe the freezing is more stochastic than I thought. Anyway, here is a frozen process

[0 - 7f401fffec80]   36.339152 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
prometeo_ConstPropMix.exec: /home/cmelone/gpu-oct/legion/runtime/realm/cuda/cuda_module.cc:370: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
Legion process received signal 6: Aborted
Process 1325936 on node g0004.stanford.edu is frozen!

I set the process to run for 24 hours.

lightsighter · 2023-10-10T00:03:30Z

Maybe the freezing is more stochastic than I thought. Anyway, here is a frozen process

Ok, so it's not actually freezing. It is crashing and you've set REALM_FREEZE_ON_ERROR.

Anyway, here is a frozen process

So this does look like one of the crashes from before not something related to the merge of betterend. The merge of betterend probably did perturb the timing though of some things which may have shaken loose some additional issues. This was run with betterend and with all of Artem's most recent changes up-streamed?

cmelone · 2023-10-10T03:07:45Z

Thanks for the clarifications.

Let me outline my testing process:

Compile Legion and HTR with Artem's branch (control_replication_staggered_fix). This does not pose any issues.
Merge betterend into control_replication_staggered_fix and re-compile. CUDA errors like device-side assert triggered reappear. I will say that the original bug affected a majority of our tests, this new error is only triggered for 1/13 of them.

Note: I also tested with the latest version of control replication (including your fix), and the issues persist.

You're right, the merge of this branch may have exposed additional issues that weren't tested before.

Let me know if I can provide more details.

lightsighter · 2023-10-11T09:15:26Z

You're right, the merge of this branch may have exposed additional issues that weren't tested before.

betterend relaxed the timing of some things inside Legion which might have enabled some more parallelism to occur.

CUDA errors like device-side assert triggered reappear.

If you build all the device side code with with -lineinfo (for both application device code and Realm DMA code) then I think it should give you a line number for the device side assert that is being triggered. If it doesn't, then maybe see if it reproduces with cuda-memcheck which should give you a line number for a device side assert.

cmelone · 2023-10-12T19:29:18Z

The code is failing on an assertion of the application that signals that bad data is being produced somewhere. The major difference compared to the previous mode of failure is that data was corrupted everywhere, resulting in all tests failing. However, in this instance, only select configurations are failing. We tried running with Legion Spy and a failing run passes passed the logical and physical analysis

lightsighter · 2023-10-12T19:32:08Z

So that at least rules out the runtime analysis. Do you use futures for passing any data around or doing future reductions?

cmelone · 2023-10-12T20:01:38Z

Yes, we do.On Oct 12, 2023, at 12:32 PM, Mike Bauer ***@***.***> wrote: So that at least rules out the runtime analysis. Do you use futures for passing any data around or doing future reductions? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

lightsighter · 2023-10-12T21:22:50Z

Can you check that some of your future data is the same?

lightsighter · 2023-10-17T09:39:37Z

Ok, I think I pushed a fix for this issue. @cmelone please pull the latest control replication branch and try again.

cmelone · 2023-10-17T17:37:59Z

thank you for the assistance, Mike and Artem!

streichler self-assigned this Sep 12, 2023

elliottslaughter mentioned this issue Sep 12, 2023

Prioritized list of Regent features for HTR (PSAAP) #1032

Open

82 tasks

cmelone closed this as completed Oct 17, 2023

Regression in GPU executions #1550

Regression in GPU executions #1550

Comments

cmelone commented Sep 12, 2023 • edited Loading

streichler commented Sep 12, 2023

apryakhin commented Sep 12, 2023

apryakhin commented Sep 12, 2023

cmelone commented Sep 12, 2023

mariodirenzo commented Sep 12, 2023

apryakhin commented Sep 12, 2023

apryakhin commented Sep 13, 2023

mariodirenzo commented Sep 19, 2023

apryakhin commented Sep 19, 2023

apryakhin commented Sep 21, 2023

cmelone commented Sep 21, 2023

mariodirenzo commented Sep 26, 2023

lightsighter commented Sep 26, 2023

cmelone commented Sep 26, 2023

apryakhin commented Sep 27, 2023

apryakhin commented Sep 27, 2023

mariodirenzo commented Sep 29, 2023

apryakhin commented Sep 29, 2023 • edited Loading

apryakhin commented Sep 29, 2023

streichler commented Sep 29, 2023

mariodirenzo commented Sep 29, 2023

streichler commented Sep 29, 2023

mariodirenzo commented Sep 29, 2023

mariodirenzo commented Oct 3, 2023

elliottslaughter commented Oct 3, 2023

apryakhin commented Oct 3, 2023

cmelone commented Oct 4, 2023

apryakhin commented Oct 4, 2023

apryakhin commented Oct 5, 2023

mariodirenzo commented Oct 5, 2023

lightsighter commented Oct 5, 2023

cmelone commented Oct 6, 2023

lightsighter commented Oct 7, 2023

apryakhin commented Oct 7, 2023

cmelone commented Oct 9, 2023 • edited Loading

lightsighter commented Oct 9, 2023

cmelone commented Oct 9, 2023

lightsighter commented Oct 9, 2023

cmelone commented Oct 9, 2023

lightsighter commented Oct 10, 2023

cmelone commented Oct 10, 2023

lightsighter commented Oct 11, 2023

cmelone commented Oct 12, 2023

lightsighter commented Oct 12, 2023

cmelone commented Oct 12, 2023 via email

lightsighter commented Oct 12, 2023

lightsighter commented Oct 17, 2023

cmelone commented Oct 17, 2023

cmelone commented Sep 12, 2023 •

edited

Loading

apryakhin commented Sep 29, 2023 •

edited

Loading

cmelone commented Oct 9, 2023 •

edited

Loading