Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm: slow data transposes on CPUs #1493

Open
Tracked by #1032
mariodirenzo opened this issue Jun 21, 2023 · 29 comments
Open
Tracked by #1032

Realm: slow data transposes on CPUs #1493

mariodirenzo opened this issue Jun 21, 2023 · 29 comments
Milestone

Comments

@mariodirenzo
Copy link

I've recently implemented a version of HTR that uses various layouts of the same data to optimize the loop performance.
There are currently two approaches that have been implemented in the solver:

Approach 1
All the instances managed by the runtime are ordered with the dimensions X, Y, Z, and each task requiring transposed data creates a DeferredBuffer and copies the data into the buffer using an OpenMP loop.

Approach 2
Set the layout constraints for each task so that Realm performs the copies from instances with the following orders X, Y, Z, Y, X Z, Z X Y.

Considering that at least two tasks reuse the same data in a single layout, Approach 1 makes many more copies with transposes and is expected to be slower. Approach 2 reuses the same "transposed" instance for multiple tasks.

I've tested the two implementations solving a three-periodic flow on a 128^3 grid with the following wall times:

  • Approach 1: 37.935 s
  • Approach 1 (deactivating the OpenMP optimization for the loops that copy data to the deferred buffer): 77.906 s
  • Approach 2: 136.324 s

PS: @elliottslaughter could you please add this issue to #1032 with low priority?

@streichler
Copy link
Contributor

@mariodirenzo this comment for the gpu bug: #1494 (comment)
has me wondering - is the field size 24 bytes in the cpu case as well? that may be falling off the fast path there too

@mariodirenzo
Copy link
Author

Yes, there is at least one 24 bytes field. Then there is also a 40 bytes field. These are effectively double[n] arrays. Depending on the specific setup that we are using, some of these fields might get larger (even to 200 bytes or so)

@apryakhin
Copy link
Contributor

I have another patch to handle arbitrary field sizes for gpu transpose which includes 24-byte fields. I am testing it right now and it should be in a branch hopefully within a day or two

@streichler
Copy link
Contributor

Sounds good. Once you've got any functional issues cleaned up, can you measure the performance in the unit test of one of the slow copies (we can get the exact shape from the logs that @mariodirenzo has captured) on both the old code and the new code?

@apryakhin
Copy link
Contributor

apryakhin commented Jun 26, 2023

Sure, I will benchmark the exact shape shortly and post the results.

Just an update - so the patch is in the branch (cuda-dma-fix-transpose). I am leaving it there and not merging to the cuda-dma as it needs to go through the PR-review first.

However, the patch now effectively allows to handle all field sizes e..g 24B, 40B etc. by splitting this on 8B/16B chunks as well as various instance shapes..

The only remaining implementation action item here is to figure out whether we could replace while (num_planes) {cuMemcpy2DAsync(plane...)} with cuMemcpy3DAsync. This should be the the fallback path to do a transpose in case we can't handle it with cuda (or shouldn't).

I spent some time reading the API reference:

and did a few tests to find out whether one can transpose a 2D plane with this but I couldn't get it. @streichler I am either not getting it right and missing something obvious or the API is indeed not flexible enough to specify src/dst strides to make it work. Let me know what you think:

typedef struct CUDA_MEMCPY3D_st {
                 unsigned int srcXInBytes, srcY, srcZ;
                 unsigned int srcLOD;
                 const void *srcHost;
                 unsigned int srcPitch;  // ignored when src is array
                 unsigned int srcHeight; // ignored when src is array; may be 0 if Depth==1
     
                 unsigned int dstXInBytes, dstY, dstZ;
                 unsigned int dstLOD;
                 unsigned int dstPitch;  // ignored when dst is array
                 unsigned int dstHeight; // ignored when dst is array; may be 0 if Depth==1      
                 unsigned int WidthInBytes;
                 unsigned int Height;
                 unsigned int Depth;
             }
            ```

@apryakhin
Copy link
Contributor

apryakhin commented Jun 27, 2023

@streichler Here are some numbers for the 63x128x128 and field sizes 8B/24B/32B/40B

legacy code

[0 - 7fef00022800]   68.169524 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=8 time=0.0345424 bw=0.246659

[0 - 7f4054dd5800]   75.850780 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=24 time=0.0358856 bw=0.712279

[0 - 7f07f093f800]   71.925815 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=32 time=0.0354556 bw=0.961223

[0 - 7f5c54e6d800]   68.757294 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=40 time=0.0346297 bw=1.23018

[0 - 7f8de01d2800]  137.145434 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<127,128,128> bounds_pad=<0,0,0>..<127,128,128> field_size=40 time=0.0692203 bw=1.23088

cuda-dma the fallback path with cuMemcpy2DAsync

[0 - 7f6decc1f800]   71.701260 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=8 time=0.0318598 bw=0.267428

[0 - 7fab61085800]   67.850831 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=24 time=0.0423919 bw=0.602959

[0 - 7fc4a4ea5800]   71.327745 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=32 time=0.032147 bw=1.06015

[0 - 7f9c940a9800]   71.310073 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=40 time=0.0320739 bw=1.32821

[0 - 7fa56c707800]  133.768523 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<127,128,128> bounds_pad=<0,0,0>..<127,128,128> field_size=40 time=0.0634485 bw=1.34285

cuda-dma gpu transpose

[0 - 7ff1eceaf800]   68.316929 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=8 time=0.00099733 bw=8.543

[0 - 7f3fd8bb4800]   70.789758 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=24 time=0.00131024 bw=19.5083

[0 - 7f91f00d7800]   72.633658 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=32 time=0.00116978 bw=29.1343

[0 - 7f15a1ef1800]   72.782363 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<63,128,128> bounds_pad=<0,0,0>..<63,128,128> field_size=40 time=0.000958376 bw=44.4512

[0 - 7f64f03ec800]  140.584452 {3}{app}: src=XYZ dst=YXZ bounds=<0,0,0>..<127,128,128> bounds_pad=<0,0,0>..<127,128,128> field_size=40 time=0.00107049 bw=79.5919

@streichler
Copy link
Contributor

Here are some numbers for the 63x128x128 and field sizes 8B/24B/32B/40B

These numbers look much better. Let's get the PR through the system.

The only remaining implementation action item here is to figure out whether we could replace while (num_planes) {cuMemcpy2DAsync(plane...)} with cuMemcpy3DAsync.

Unless the CUDA driver behavior has changed recently, cudaMemcpy3DAsync appears to be implemented as a cpu-side loop of 2D copy launches - i.e. exactly what we're doing in the realm code but without the time limit check. That check is what keeps the DMA system responsive to other work when a large 3D copy is running, so I'd only want to switch to the driver's 3D copy if we are certain that something has changed and the host-side overhead has become constant (e.g. ~2us per entire 3D copy request) rather than linear in the number of planes in the request.

or the API is indeed not flexible enough to specify src/dst strides to make it work

It is indeed not flexible enough. The current cudaMemcpy{2D,3D}... calls require strides to grow monotonicially (and also be multiples of the previous one), preventing them from handling transpose or weird padding cases.

@mariodirenzo
Copy link
Author

Looks great!
Do you have the updated version of the runtime in some branch?
I'd like to the entire CI of HTR on it to confirm that there aren't any other cases that fail

@apryakhin
Copy link
Contributor

I apologize as I have re-posted the same comment under:

But yeah the numbers are there and for the problem size in your very initial post here e.g. 128^3 grid we get about ~6.2 seconds now on sapling2. The similar numbers I have for ChannelFlow. I will be offline and will likely be submitting the patch early next week.

@apryakhin
Copy link
Contributor

@mariodirenzo The patch is in cuda-dma-fix-transpose branch now. Please feel free to give it a shot when you get a chance.

@mariodirenzo
Copy link
Author

Thanks @apryakhin! The code is much faster now.
I've tried running my CI on the cuda-dma-fix-transpose but I see some deadlocks on the code as soon as I put more than one tile of the computational per rank.
The deadlocks do not seem deterministic, but I can provide a simple script to run on sapling2 to reproduce the issue.

@apryakhin
Copy link
Contributor

apryakhin commented Jul 10, 2023

Thanks, if you can give me the script that would be great!

@mariodirenzo
Copy link
Author

You need to go into your $HTR_DIR/solverTests and execute the command python3 testAll.py. It will go into each subfolder of the directory and execute the associated test of the solver for different partitionings. In practice, you will see multiple jobs submitted to the queue for each folder. Each of these tests takes one or two minutes on GPU. If you see a job in the queue that has been executing there for more than 5 minutes, you can safely assume that it has been hanging.

@apryakhin
Copy link
Contributor

apryakhin commented Jul 11, 2023

I am running the following:

apriakhin@sapling2:~/Development/htr3/solverTests/ChannelFlow$ cat base.json 
{
   "Mapping" : {
      "tiles" : [1,2,2],
      "tilesPerRank" : [1,2,2],
      "wallTime" : 60
   },

   "Grid" : {
      "xNum" : 128,
      "yNum" : 128,
      "zNum" : 128,
      "GridInput" : {
         "type" : "Cartesian",
         "origin" : [0.0, 0.0, 0.0],
         "width" : [0.0002819697924580275, 0.00017950754508916364, 0.000563939584916055],
         "xType" : { "type" : "Uniform" },
         "yType" : {
            "type" : "Tanh",
            "Stretching" : 1.697344715398394
         },
         "zType" : { "type" : "Uniform" }
      }
   },
   ...

with rm -rf sample0 slurm-*; CUDA_LAUNCH_BLOCKING= DEBUG=0 ../../prometeo.sh -i base.json -logfile %.log -lg:registration the ChanneFlow solver test.

The legion is built off master branch, so without any cuda-dma related stuff and I observe occasional hangs (not deterministic) on sapling2. I can give it a go with control_replication but I am leaning towards the fact that the nature of this hang is unrelated to the cuda-dma.

We should probably continue this discussion and root-causing in a separate github issue.

I am going to do a few more runs with other combinations of solver test vs legion branch just to make sure.

@lightsighter
Copy link
Contributor

Please check that there are no copies hanging with the logs of -level dma=2 first. I'd be very surprised if you can create a hang on a single process.

@mariodirenzo
Copy link
Author

I've collected the backtraces of a hanging execution and the dma=1 logs. Both files are attached.
Please let me know if you need further logs or if we should move this discussion to a separate thread.

dma_0.log
bt.log

@apryakhin
Copy link
Contributor

@mariodirenzo which branch did run that with?

@mariodirenzo
Copy link
Author

cuda-dma-fix-transpose

@apryakhin
Copy link
Contributor

apryakhin commented Jul 13, 2023

We should give it a go on a control_replication branch. Since it hangs on master as well I don' t think it's in the context of slow transposes anymore. Let's file another github issue to make sure we aren't messing up the history of updates here.

@mariodirenzo
Copy link
Author

Interestingly I can't reproduce the hang on control_replication.

@apryakhin
Copy link
Contributor

apryakhin commented Jul 13, 2023

Okay, what we could do as an experiment (in addition to root causing the original hang) is to make a fresh build of cuda-dma based on control_replication and give it go. Sees like this combination would make sense

@apryakhin
Copy link
Contributor

I have just pushed latest cuda-dma changes (including all transpose fixes) to cuda_dma_control_replication

@mariodirenzo
Copy link
Author

The code works fine on cuda_dma_control_replication. I think that we can safely say that it was a bug of master that has been fixed on control_replication

@lightsighter
Copy link
Contributor

Agreed. I wouldn't spend a lot of time worrying about it unless we need to run HTR with master.

@mariodirenzo
Copy link
Author

I believe that all the users of HTR do their production runs on control_replicaton

@apryakhin
Copy link
Contributor

apryakhin commented Jul 14, 2023

Acknowledge. Then I consider that the issue with cuda-dma and HTR has been solved. Thanks

@mariodirenzo
Copy link
Author

mariodirenzo commented Jul 14, 2023

Yes, thank you very much for your help!
I'll close #1032, which was about the GPU transposes, and keep this open.
Do you think that similar changes could be applied to the CPU copies?

@apryakhin
Copy link
Contributor

The cuda-dma branch won't help obviously with CPU transposes and it is doubtful that similar changes can be applied in this scenario. However, we can explore the possibility of optimizing either your specific use case or making general improvements.

@seemamirch
Copy link
Contributor

CPU versions of 3DPeriodic on shepard are as follows
Modified HTR (prometeo_constraints.cc to support transpose layouts) -> 338 seconds
Original HTR -> 343 seconds

There are 3 profiles of one of the tests - all 3 took roughly 30 seconds (+/- .5 sec)
no transposes - original htr -> https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/legion_prof_no_transpose/
transposes - modified htr -> https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/legion_prof_transpose/
transposes + modified realm (specialized copy for pattern seen in htr) -> https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/legion_prof_transpose_opt/

@apryakhin apryakhin added this to the realm-backlog milestone Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants