Realm: slow data transposes on CPUs #1493

mariodirenzo · 2023-06-21T14:01:52Z

I've recently implemented a version of HTR that uses various layouts of the same data to optimize the loop performance.
There are currently two approaches that have been implemented in the solver:

Approach 1
All the instances managed by the runtime are ordered with the dimensions X, Y, Z, and each task requiring transposed data creates a DeferredBuffer and copies the data into the buffer using an OpenMP loop.

Approach 2
Set the layout constraints for each task so that Realm performs the copies from instances with the following orders X, Y, Z, Y, X Z, Z X Y.

Considering that at least two tasks reuse the same data in a single layout, Approach 1 makes many more copies with transposes and is expected to be slower. Approach 2 reuses the same "transposed" instance for multiple tasks.

I've tested the two implementations solving a three-periodic flow on a 128^3 grid with the following wall times:

Approach 1: 37.935 s
Approach 1 (deactivating the OpenMP optimization for the loops that copy data to the deferred buffer): 77.906 s
Approach 2: 136.324 s

PS: @elliottslaughter could you please add this issue to #1032 with low priority?

The text was updated successfully, but these errors were encountered:

streichler · 2023-06-21T22:28:47Z

@mariodirenzo this comment for the gpu bug: #1494 (comment)
has me wondering - is the field size 24 bytes in the cpu case as well? that may be falling off the fast path there too

mariodirenzo · 2023-06-22T06:25:43Z

Yes, there is at least one 24 bytes field. Then there is also a 40 bytes field. These are effectively double[n] arrays. Depending on the specific setup that we are using, some of these fields might get larger (even to 200 bytes or so)

apryakhin · 2023-06-23T14:15:05Z

I have another patch to handle arbitrary field sizes for gpu transpose which includes 24-byte fields. I am testing it right now and it should be in a branch hopefully within a day or two

streichler · 2023-06-23T15:02:55Z

Sounds good. Once you've got any functional issues cleaned up, can you measure the performance in the unit test of one of the slow copies (we can get the exact shape from the logs that @mariodirenzo has captured) on both the old code and the new code?

apryakhin · 2023-06-26T20:32:36Z

Sure, I will benchmark the exact shape shortly and post the results.

Just an update - so the patch is in the branch (cuda-dma-fix-transpose). I am leaving it there and not merging to the cuda-dma as it needs to go through the PR-review first.

However, the patch now effectively allows to handle all field sizes e..g 24B, 40B etc. by splitting this on 8B/16B chunks as well as various instance shapes..

The only remaining implementation action item here is to figure out whether we could replace while (num_planes) {cuMemcpy2DAsync(plane...)} with cuMemcpy3DAsync. This should be the the fallback path to do a transpose in case we can't handle it with cuda (or shouldn't).

I spent some time reading the API reference:

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g79f4f3fde6ae0f529568d881d9e11987

and did a few tests to find out whether one can transpose a 2D plane with this but I couldn't get it. @streichler I am either not getting it right and missing something obvious or the API is indeed not flexible enough to specify src/dst strides to make it work. Let me know what you think:

typedef struct CUDA_MEMCPY3D_st {
                 unsigned int srcXInBytes, srcY, srcZ;
                 unsigned int srcLOD;
                 const void *srcHost;
                 unsigned int srcPitch;  // ignored when src is array
                 unsigned int srcHeight; // ignored when src is array; may be 0 if Depth==1
     
                 unsigned int dstXInBytes, dstY, dstZ;
                 unsigned int dstLOD;
                 unsigned int dstPitch;  // ignored when dst is array
                 unsigned int dstHeight; // ignored when dst is array; may be 0 if Depth==1      
                 unsigned int WidthInBytes;
                 unsigned int Height;
                 unsigned int Depth;
             }
            ```

apryakhin · 2023-06-27T11:46:46Z

@streichler Here are some numbers for the 63x128x128 and field sizes 8B/24B/32B/40B

legacy code

[0 - 7fef00022800]   68.169524 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=8 time=0.0345424 bw=0.246659

[0 - 7f4054dd5800]   75.850780 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=24 time=0.0358856 bw=0.712279

[0 - 7f07f093f800]   71.925815 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=32 time=0.0354556 bw=0.961223

[0 - 7f5c54e6d800]   68.757294 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=40 time=0.0346297 bw=1.23018

[0 - 7f8de01d2800]  137.145434 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<127,128,128> bounds_pad=<0,0,0>..<127,128,128> field_size=40 time=0.0692203 bw=1.23088

cuda-dma the fallback path with cuMemcpy2DAsync

[0 - 7f6decc1f800]   71.701260 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=8 time=0.0318598 bw=0.267428

[0 - 7fab61085800]   67.850831 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=24 time=0.0423919 bw=0.602959

[0 - 7fc4a4ea5800]   71.327745 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=32 time=0.032147 bw=1.06015

[0 - 7f9c940a9800]   71.310073 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=40 time=0.0320739 bw=1.32821

[0 - 7fa56c707800]  133.768523 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<127,128,128> bounds_pad=<0,0,0>..<127,128,128> field_size=40 time=0.0634485 bw=1.34285

cuda-dma gpu transpose

[0 - 7ff1eceaf800]   68.316929 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=8 time=0.00099733 bw=8.543

[0 - 7f3fd8bb4800]   70.789758 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=24 time=0.00131024 bw=19.5083

[0 - 7f91f00d7800]   72.633658 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=32 time=0.00116978 bw=29.1343

[0 - 7f15a1ef1800]   72.782363 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=40 time=0.000958376 bw=44.4512

[0 - 7f64f03ec800]  140.584452 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<127,128,128> bounds_pad=<0,0,0>..<127,128,128> field_size=40 time=0.00107049 bw=79.5919

streichler · 2023-06-27T13:08:45Z

Here are some numbers for the 63x128x128 and field sizes 8B/24B/32B/40B

These numbers look much better. Let's get the PR through the system.

The only remaining implementation action item here is to figure out whether we could replace while (num_planes) {cuMemcpy2DAsync(plane...)} with cuMemcpy3DAsync.

Unless the CUDA driver behavior has changed recently, cudaMemcpy3DAsync appears to be implemented as a cpu-side loop of 2D copy launches - i.e. exactly what we're doing in the realm code but without the time limit check. That check is what keeps the DMA system responsive to other work when a large 3D copy is running, so I'd only want to switch to the driver's 3D copy if we are certain that something has changed and the host-side overhead has become constant (e.g. ~2us per entire 3D copy request) rather than linear in the number of planes in the request.

or the API is indeed not flexible enough to specify src/dst strides to make it work

It is indeed not flexible enough. The current cudaMemcpy{2D,3D}... calls require strides to grow monotonicially (and also be multiples of the previous one), preventing them from handling transpose or weird padding cases.

mariodirenzo · 2023-06-30T17:58:36Z

Looks great!
Do you have the updated version of the runtime in some branch?
I'd like to the entire CI of HTR on it to confirm that there aren't any other cases that fail

apryakhin · 2023-06-30T18:54:57Z

I apologize as I have re-posted the same comment under:

Realm: slow data transposes on GPUs #1494 (comment)

But yeah the numbers are there and for the problem size in your very initial post here e.g. 128^3 grid we get about ~6.2 seconds now on sapling2. The similar numbers I have for ChannelFlow. I will be offline and will likely be submitting the patch early next week.

apryakhin · 2023-07-05T07:12:30Z

@mariodirenzo The patch is in cuda-dma-fix-transpose branch now. Please feel free to give it a shot when you get a chance.

mariodirenzo · 2023-07-10T15:42:59Z

Thanks @apryakhin! The code is much faster now.
I've tried running my CI on the cuda-dma-fix-transpose but I see some deadlocks on the code as soon as I put more than one tile of the computational per rank.
The deadlocks do not seem deterministic, but I can provide a simple script to run on sapling2 to reproduce the issue.

apryakhin · 2023-07-10T15:44:48Z

Thanks, if you can give me the script that would be great!

mariodirenzo · 2023-07-10T15:55:13Z

You need to go into your $HTR_DIR/solverTests and execute the command python3 testAll.py. It will go into each subfolder of the directory and execute the associated test of the solver for different partitionings. In practice, you will see multiple jobs submitted to the queue for each folder. Each of these tests takes one or two minutes on GPU. If you see a job in the queue that has been executing there for more than 5 minutes, you can safely assume that it has been hanging.

apryakhin · 2023-07-11T10:52:26Z

I am running the following:

apriakhin@sapling2:~/Development/htr3/solverTests/ChannelFlow$ cat base.json 
{
   "Mapping" : {
      "tiles" : [1,2,2],
      "tilesPerRank" : [1,2,2],
      "wallTime" : 60
   },

   "Grid" : {
      "xNum" : 128,
      "yNum" : 128,
      "zNum" : 128,
      "GridInput" : {
         "type" : "Cartesian",
         "origin" : [0.0, 0.0, 0.0],
         "width" : [0.0002819697924580275, 0.00017950754508916364, 0.000563939584916055],
         "xType" : { "type" : "Uniform" },
         "yType" : {
            "type" : "Tanh",
            "Stretching" : 1.697344715398394
         },
         "zType" : { "type" : "Uniform" }
      }
   },
   ...

with rm -rf sample0 slurm-*; CUDA_LAUNCH_BLOCKING= DEBUG=0 ../../prometeo.sh -i base.json -logfile %.log -lg:registration the ChanneFlow solver test.

The legion is built off master branch, so without any cuda-dma related stuff and I observe occasional hangs (not deterministic) on sapling2. I can give it a go with control_replication but I am leaning towards the fact that the nature of this hang is unrelated to the cuda-dma.

We should probably continue this discussion and root-causing in a separate github issue.

I am going to do a few more runs with other combinations of solver test vs legion branch just to make sure.

lightsighter · 2023-07-11T19:24:03Z

Please check that there are no copies hanging with the logs of -level dma=2 first. I'd be very surprised if you can create a hang on a single process.

mariodirenzo · 2023-07-12T07:50:56Z

I've collected the backtraces of a hanging execution and the dma=1 logs. Both files are attached.
Please let me know if you need further logs or if we should move this discussion to a separate thread.

dma_0.log
bt.log

apryakhin · 2023-07-12T20:33:29Z

@mariodirenzo which branch did run that with?

mariodirenzo · 2023-07-12T20:51:12Z

cuda-dma-fix-transpose

apryakhin · 2023-07-13T06:09:22Z

We should give it a go on a control_replication branch. Since it hangs on master as well I don' t think it's in the context of slow transposes anymore. Let's file another github issue to make sure we aren't messing up the history of updates here.

mariodirenzo · 2023-07-13T08:37:50Z

Interestingly I can't reproduce the hang on control_replication.

apryakhin · 2023-07-13T18:47:04Z

Okay, what we could do as an experiment (in addition to root causing the original hang) is to make a fresh build of cuda-dma based on control_replication and give it go. Sees like this combination would make sense

apryakhin · 2023-07-13T19:20:49Z

I have just pushed latest cuda-dma changes (including all transpose fixes) to cuda_dma_control_replication

mariodirenzo · 2023-07-14T07:47:16Z

The code works fine on cuda_dma_control_replication. I think that we can safely say that it was a bug of master that has been fixed on control_replication

lightsighter · 2023-07-14T08:08:01Z

Agreed. I wouldn't spend a lot of time worrying about it unless we need to run HTR with master.

mariodirenzo · 2023-07-14T08:14:50Z

I believe that all the users of HTR do their production runs on control_replicaton

apryakhin · 2023-07-14T09:13:12Z

Acknowledge. Then I consider that the issue with cuda-dma and HTR has been solved. Thanks

mariodirenzo · 2023-07-14T09:15:54Z

Yes, thank you very much for your help!
I'll close #1032, which was about the GPU transposes, and keep this open.
Do you think that similar changes could be applied to the CPU copies?

apryakhin · 2023-07-14T13:27:19Z

The cuda-dma branch won't help obviously with CPU transposes and it is doubtful that similar changes can be applied in this scenario. However, we can explore the possibility of optimizing either your specific use case or making general improvements.

seemamirch · 2024-08-21T21:04:26Z

CPU versions of 3DPeriodic on shepard are as follows
Modified HTR (prometeo_constraints.cc to support transpose layouts) -> 338 seconds
Original HTR -> 343 seconds

There are 3 profiles of one of the tests - all 3 took roughly 30 seconds (+/- .5 sec)
no transposes - original htr -> https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/legion_prof_no_transpose/
transposes - modified htr -> https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/legion_prof_transpose/
transposes + modified realm (specialized copy for pattern seen in htr) -> https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/legion_prof_transpose_opt/

mariodirenzo mentioned this issue Jun 21, 2023

Realm: slow data transposes on GPUs #1494

Closed

elliottslaughter mentioned this issue Jun 21, 2023

Prioritized list of Regent features for HTR (PSAAP) #1032

Open

82 tasks

apryakhin added this to the realm-backlog milestone Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realm: slow data transposes on CPUs #1493

Realm: slow data transposes on CPUs #1493

mariodirenzo commented Jun 21, 2023

streichler commented Jun 21, 2023

mariodirenzo commented Jun 22, 2023

apryakhin commented Jun 23, 2023

streichler commented Jun 23, 2023

apryakhin commented Jun 26, 2023 •

edited

Loading

apryakhin commented Jun 27, 2023 •

edited

Loading

streichler commented Jun 27, 2023

mariodirenzo commented Jun 30, 2023

apryakhin commented Jun 30, 2023

apryakhin commented Jul 5, 2023

mariodirenzo commented Jul 10, 2023

apryakhin commented Jul 10, 2023 •

edited

Loading

mariodirenzo commented Jul 10, 2023

apryakhin commented Jul 11, 2023 •

edited

Loading

lightsighter commented Jul 11, 2023

mariodirenzo commented Jul 12, 2023

apryakhin commented Jul 12, 2023

mariodirenzo commented Jul 12, 2023

apryakhin commented Jul 13, 2023 •

edited

Loading

mariodirenzo commented Jul 13, 2023

apryakhin commented Jul 13, 2023 •

edited

Loading

apryakhin commented Jul 13, 2023

mariodirenzo commented Jul 14, 2023

lightsighter commented Jul 14, 2023

mariodirenzo commented Jul 14, 2023

apryakhin commented Jul 14, 2023 •

edited

Loading

mariodirenzo commented Jul 14, 2023 •

edited

Loading

apryakhin commented Jul 14, 2023

seemamirch commented Aug 21, 2024

Realm: slow data transposes on CPUs #1493

Realm: slow data transposes on CPUs #1493

Comments

mariodirenzo commented Jun 21, 2023

streichler commented Jun 21, 2023

mariodirenzo commented Jun 22, 2023

apryakhin commented Jun 23, 2023

streichler commented Jun 23, 2023

apryakhin commented Jun 26, 2023 • edited Loading

apryakhin commented Jun 27, 2023 • edited Loading

streichler commented Jun 27, 2023

mariodirenzo commented Jun 30, 2023

apryakhin commented Jun 30, 2023

apryakhin commented Jul 5, 2023

mariodirenzo commented Jul 10, 2023

apryakhin commented Jul 10, 2023 • edited Loading

mariodirenzo commented Jul 10, 2023

apryakhin commented Jul 11, 2023 • edited Loading

lightsighter commented Jul 11, 2023

mariodirenzo commented Jul 12, 2023

apryakhin commented Jul 12, 2023

mariodirenzo commented Jul 12, 2023

apryakhin commented Jul 13, 2023 • edited Loading

mariodirenzo commented Jul 13, 2023

apryakhin commented Jul 13, 2023 • edited Loading

apryakhin commented Jul 13, 2023

mariodirenzo commented Jul 14, 2023

lightsighter commented Jul 14, 2023

mariodirenzo commented Jul 14, 2023

apryakhin commented Jul 14, 2023 • edited Loading

mariodirenzo commented Jul 14, 2023 • edited Loading

apryakhin commented Jul 14, 2023

seemamirch commented Aug 21, 2024

apryakhin commented Jun 26, 2023 •

edited

Loading

apryakhin commented Jun 27, 2023 •

edited

Loading

apryakhin commented Jul 10, 2023 •

edited

Loading

apryakhin commented Jul 11, 2023 •

edited

Loading

apryakhin commented Jul 13, 2023 •

edited

Loading

apryakhin commented Jul 13, 2023 •

edited

Loading

apryakhin commented Jul 14, 2023 •

edited

Loading

mariodirenzo commented Jul 14, 2023 •

edited

Loading