-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU-friendly loads / merge_frames #2565
Comments
Hrm, that does seem like incorrect behavior in the GPU memory case It also means that these lines weren't triggered. distributed/distributed/protocol/utils.py Lines 60 to 61 in fb30c33
This is somewhat surprising, especially given that UCX handles message framing itself. I wonder if this is a case where we should be using |
Ahh... I completely missed those lines. That may actually explain it. I've been hacking up the serialization code to make things work. I likely have a bug in there. |
So, after looking at this further, there may still be some merit to this issue. I'm sending a cudf DataFrame with three columns and no nulls, so that's a total of 3 numba device arrays (I ignore the index for now). Somewhere on the send size ( distributed/distributed/protocol/utils.py Lines 60 to 61 in fb30c33
(Pdb) len(frames)
72
(Pdb) len(lengths)
3
(Pdb) len(gpu_frames)
9
(Pdb) sum(x.nbytes for x in gpu_frames) == sum(lengths)
True At the cost of another memory allocation (and a numba dependency), we can do something similar to the host-memory case. For each of the 3 "expected" frames, I allocate an empty numba DeviceNDArray with the expected length. I then iterate over the "split" frames, filling in the empty DeviceNDArray. When I do that, returning the 3 new DeviceArrays from |
After writing this up, I had the bright idea of confirming that And, finally getting to the point where we're starting to see some payoff:
I wouldn't put too much stock in those numbers yet, but it's at least encouraging. In particular this is mixing
but it's something :) |
My guess is that we need a |
Yeah, I think that seems to be necessary. We would dispatch upon It's a little unfortunate. bytes/memoryview is sufficient for all host-memory things. It'd be nice if there was a device-equivalent, so we would just need to include a kind of 'is_gpu' key in the header and dispatch to host or GPU based on that. |
Or, at this point, on the type of the frames (so that this wouldn't be as tied to Dask's protocol).
We are probably the most qualified group to start defining what that looks like. I encourage you to consider what requirements would look like. |
This is the first time I've thought "I could see this being a PEP someday".
But not today :)
…On Fri, Mar 15, 2019 at 2:50 PM Matthew Rocklin ***@***.***> wrote:
We would dispatch upon header['type'], right
Or, at this point, on the type of the frames (so that this wouldn't be as
tied to Dask's protocol).
It'd be nice if there was a device-equivalent
We are probably the most qualified group to start defining what that looks
like. I encourage you to consider what requirements would look like.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2565 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIhsErKOwMaLISjewNkVegLSGiEcwks5vW_l6gaJpZM4b22Ez>
.
|
I suspect that we would start with something that was lighter weight, like a protocol (like |
Am curious what other things we wanted to support here @TomAugspurger 🙂 Are there things that we still haven't implemented today? Edit: It may be that PR ( #3732 ) solves this or at least makes it much easier to solve whatever is left here IIUC what is being asked. |
Does the benchmark in https://github.com/TomAugspurger/dask-perf/blob/master/bench_cudf_join.py run (I don't have easy access to a GPU anymore, otherwise I would check)? If so, I'm guessing things are sufficiently fixed :) |
Hmm...I'm not actually sure how one should run that these days. It looks like we need to setup our own scheduler. Also |
To the spirit of the issue, we do have |
I think we'll close this and reopen if we come across it with the current implementations. |
Just throwing this up here for now, need to investigate more.
I'm working on a distributed cudf join using UCX. Things progress fine until, in the Client process, we attempt to deserialize some data (I think the final result?). We end up calling calling
loads
withdeserialize=True
:distributed/distributed/protocol/core.py
Line 101 in fb30c33
which calls
merge_frames
:distributed/distributed/protocol/utils.py
Line 80 in fb30c33
which attempt to convert the data to a byte string.
At this point in the client process,
frames
is a list of objects representing device memory. If possible (and I think it's possible), I'd like to avoid copying to the host here.Actually, this may only be possible if the Client happens to have a GPU as well. In this case that's true, but not in general.
TODO:
The text was updated successfully, but these errors were encountered: