-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Realm: slow data transposes on CPUs #1493
Comments
@mariodirenzo this comment for the gpu bug: #1494 (comment) |
Yes, there is at least one 24 bytes field. Then there is also a 40 bytes field. These are effectively |
I have another patch to handle arbitrary field sizes for gpu transpose which includes 24-byte fields. I am testing it right now and it should be in a branch hopefully within a day or two |
Sounds good. Once you've got any functional issues cleaned up, can you measure the performance in the unit test of one of the slow copies (we can get the exact shape from the logs that @mariodirenzo has captured) on both the old code and the new code? |
Sure, I will benchmark the exact shape shortly and post the results. Just an update - so the patch is in the branch ( However, the patch now effectively allows to handle all field sizes e..g 24B, 40B etc. by splitting this on 8B/16B chunks as well as various instance shapes.. The only remaining implementation action item here is to figure out whether we could replace I spent some time reading the API reference: and did a few tests to find out whether one can transpose a 2D plane with this but I couldn't get it. @streichler I am either not getting it right and missing something obvious or the API is indeed not flexible enough to specify src/dst strides to make it work. Let me know what you think:
|
@streichler Here are some numbers for the 63x128x128 and field sizes 8B/24B/32B/40B legacy code
cuda-dma the fallback path with cuMemcpy2DAsync
cuda-dma gpu transpose
|
These numbers look much better. Let's get the PR through the system.
Unless the CUDA driver behavior has changed recently,
It is indeed not flexible enough. The current |
Looks great! |
I apologize as I have re-posted the same comment under: But yeah the numbers are there and for the problem size in your very initial post here e.g. |
@mariodirenzo The patch is in |
Thanks @apryakhin! The code is much faster now. |
Thanks, if you can give me the script that would be great! |
You need to go into your |
I am running the following:
with The legion is built off master branch, so without any cuda-dma related stuff and I observe occasional hangs (not deterministic) on sapling2. I can give it a go with We should probably continue this discussion and root-causing in a separate github issue. I am going to do a few more runs with other combinations of solver test vs legion branch just to make sure. |
Please check that there are no copies hanging with the logs of |
@mariodirenzo which branch did run that with? |
|
We should give it a go on a |
Interestingly I can't reproduce the hang on |
Okay, what we could do as an experiment (in addition to root causing the original hang) is to make a fresh build of |
I have just pushed latest |
The code works fine on |
Agreed. I wouldn't spend a lot of time worrying about it unless we need to run HTR with master. |
I believe that all the users of HTR do their production runs on |
Acknowledge. Then I consider that the issue with cuda-dma and HTR has been solved. Thanks |
Yes, thank you very much for your help! |
The cuda-dma branch won't help obviously with CPU transposes and it is doubtful that similar changes can be applied in this scenario. However, we can explore the possibility of optimizing either your specific use case or making general improvements. |
CPU versions of 3DPeriodic on shepard are as follows There are 3 profiles of one of the tests - all 3 took roughly 30 seconds (+/- .5 sec) |
I've recently implemented a version of HTR that uses various layouts of the same data to optimize the loop performance.
There are currently two approaches that have been implemented in the solver:
Approach 1
All the instances managed by the runtime are ordered with the dimensions
X, Y, Z
, and each task requiring transposed data creates a DeferredBuffer and copies the data into the buffer using an OpenMP loop.Approach 2
Set the layout constraints for each task so that Realm performs the copies from instances with the following orders
X, Y, Z
,Y, X Z
,Z X Y
.Considering that at least two tasks reuse the same data in a single layout,
Approach 1
makes many more copies with transposes and is expected to be slower.Approach 2
reuses the same "transposed" instance for multiple tasks.I've tested the two implementations solving a three-periodic flow on a
128^3
grid with the following wall times:PS: @elliottslaughter could you please add this issue to #1032 with low priority?
The text was updated successfully, but these errors were encountered: