-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37364: [C++][GPU] Add CUDA impl of Device Event/Stream #37365
Conversation
CC @kkraus14 |
|
cpp/src/arrow/gpu/cuda_context.h
Outdated
explicit Stream(std::shared_ptr<CudaContext> ctx, CUstream stream) noexcept | ||
: context_{std::move(ctx)}, stream_{stream} {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a constructor that just works on the default CUDA context similar to how MakeDeviceSyncEvent
and WrapDeviceSyncEvent
work. Should we follow the same pattern for Streams here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about that and was trying to think about if/how we wanted to handle Stream lifetime stuff and whether we wanted to go the same route we did with events in terms of the whole unique_ptr
and custom deleter etc. to allow either the CudaDevice::Stream
owning the lifetime or not.
Do you think it makes more sense for the constructing function to be part of the memory manager or the device object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device object. You could have different memory managers of the same device using the same stream, so I think it logically should be associated to the device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the methods to the Device object, following the same pattern I set up for the SyncEvent, but for streams. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I missed something: why does Stream need to own instead of just wrap? It seems that stream lifetime management will always be handled elsewhere. For example, say there's an application built with CUDA which is adding an arrow integration: the application will already have a pool of CUstreams which only need to be wrapped in arrow Device::Streams when calling an arrow function. I don't think an arrow function will ever take ownership of the stream away from the hypothetical application, so to me it seems we don't need arrow functions to wrap them in smart pointers for dynamic lifetimes or functions for producing new streams (MakeStream).
To put this another way: would we ever need to produce a vector<shared_ptr<Stream>>
where the wrapped streams are a mix of CUstream
and hipStream_t
? Even if an application were using both ROCM and CUDA at the same time, the streams would (I'm certain) be maintained in separate pools which obviates the need for polymorphic lifetime management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with both of you and lean towards the synchronicity and convenience side of things. It's nice having the APIs be similar between the events and streams and so I lean more towards the ownership of the stream itself.
That said, I'm not opposed to having it just wrap and not own. What do you think @bkietz?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the potential aggravation due to arrow's failure to integrate seamlessly with a user's existing stream pool is greater than the potential due to a user needing to implement their own stream pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the issue with someone using Stream::WrapStream
and passing nullptr
if they don't want any lifetime management or handling something like a shared pointer in a release function if desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I'd agree with @kkraus14 that if they are trying to integrate seamlessly with an existing stream pool they could use Device::WrapStream
and pass nullptr
or whatever other releasing back to the pool they want in the release function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not a fundamental issue, it's just an IMHO: avoiding ownership entirely here will be the more user-friendly API. Certainly using Stream::WrapStream
as you describe will also work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment issue, otherwise LGTM
cpp/src/arrow/device.h
Outdated
@@ -109,19 +109,50 @@ class ARROW_EXPORT Device : public std::enable_shared_from_this<Device>, | |||
/// should be trivially constructible from it's device-specific counterparts. | |||
class ARROW_EXPORT Stream { | |||
public: | |||
virtual const void* get_raw() const { return NULLPTR; } | |||
using release_fn_t = void (*)(void*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider making this std::function
instead of a raw function pointer here? Using the former would allow for passing lambda captures at the cost of some overhead. Lambda captures would be useful for something like if someone has a smart pointer managing the lifetime of a stream.
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 3b8ab8e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…pendencies (#37497) ### Rationale for this change #37365 was built locally using a newer version of the CUDA driver API than the crossbow builds run with causing the crossbow build to fail due to a change in macro names. This puts the macro name back to allow it to build with the older version of the cuda driver api. * Closes: #37523 Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…he#37365) ### What changes are included in this PR? Adding `CudaDevice::SyncEvent` and `CudaDevice::Stream` implementations which provide more idiomatic handling of Events and Streams. ### Are these changes tested? Simple SyncEvent test added. More stream tests still being added. * Closes: apache#37364 Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…UDA dependencies (apache#37497) ### Rationale for this change apache#37365 was built locally using a newer version of the CUDA driver API than the crossbow builds run with causing the crossbow build to fail due to a change in macro names. This puts the macro name back to allow it to build with the older version of the cuda driver api. * Closes: apache#37523 Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…he#37365) ### What changes are included in this PR? Adding `CudaDevice::SyncEvent` and `CudaDevice::Stream` implementations which provide more idiomatic handling of Events and Streams. ### Are these changes tested? Simple SyncEvent test added. More stream tests still being added. * Closes: apache#37364 Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…UDA dependencies (apache#37497) ### Rationale for this change apache#37365 was built locally using a newer version of the CUDA driver API than the crossbow builds run with causing the crossbow build to fail due to a change in macro names. This puts the macro name back to allow it to build with the older version of the cuda driver api. * Closes: apache#37523 Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
What changes are included in this PR?
Adding
CudaDevice::SyncEvent
andCudaDevice::Stream
implementations which provide more idiomatic handling of Events and Streams.Are these changes tested?
Simple SyncEvent test added. More stream tests still being added.