[RFC]: C API #134

rscohn2 · 2024-08-06T19:40:51Z

I formatted Gengbin's C APi proposal as an RFC.

garzaran

Looks good.

blazej-smorawski · 2024-08-07T15:30:50Z

rfcs/20240806-c-api/README.md

+
+| NCCL              | oneCCL (proposed C)          | oneCCL (current, C++)   |
+|-------------------|------------------------------|-------------------------|
+|`cudaError_t`      |`onecclResult_t cudaSetDevice(device)(1)`| N/A          |


What's the device? We should be much more specific here. I think we don't have to cover this API. My proposal is to handle all device handles in ...Config APIs, so for example. For example passing users device to onecclCommInit would look like:

onecclCommConfig_t config = ONECCL_COMM_CONFIG_INITIALIZER; config.sycl.queue = &queue; onecclCommInitConfig(rank, size, ..., &config);

Similar pattern could be applied to onecclStreamCreate, so we could get rid of onecclCreateSyclStream which looks like a workaround rather than an elegant solution:

onecclStream_t stream; onecclStreamConfig_t stream_config = ONECCL_STREAM_SYCL_CONFIG_INITIALIZER; // or onecclStreamConfig_t config = ONECCL_STREAM_CONFIG_INITIALIZER; config.sycl.queue = &queue; onecclStreamCreateConfig(&stream, &config);

We agreed to use integer index for the devices temporarily, before SYCL releases API for device selection similar to cudaSetDevice.

For the streams, I think that onecclCreateStreamXpu(&onecclStream_t stream_ptr, void* args) should be enough for all communications backed we would like to support.

JackAKirk · 2024-09-30T15:00:58Z

Have you considered this issue intel/llvm#15251
which is described more fully here: oneapi-src/unified-runtime#2077

Are there not similar implementation issues in the level_zero backend due to the implicit device setting nature of SYCL?
I looked into the l0 adapter implementation: I think it is possible that l0 has similar issues, depending on ipc usage: I think it possible that if intel/llvm#15251 could be built for PVC you will see similar issues

I do not see how implementing a wrapper function to sycl that matches the api of cudaSetDevice such as suggested here: https://github.com/intel/llvm/
pull/15382
can solve such issues.

I think that the only current solution is to rely on the assumption that no more than one gpu device will be used per MPI rank and that environment variables such as CUDA_VISIBLE_DEVICES/ intel vendor equivalent, or ONEAPI_DEVICE_SECTOR are used: as detailed in intel/llvm#15251

garzaran · 2024-10-04T02:45:13Z

Have you considered this issue intel/llvm#15251 which is described more fully here: oneapi-src/unified-runtime#2077

Are there not similar implementation issues in the level_zero backend due to the implicit device setting nature of SYCL? I looked into the l0 adapter implementation: I think it is possible that l0 has similar issues, depending on ipc usage: I think it possible that if intel/llvm#15251 could be built for PVC you will see similar issues

I do not see how implementing a wrapper function to sycl that matches the api of cudaSetDevice such as suggested here: https://github.com/intel/llvm/ pull/15382 can solve such issues.

I think that the only current solution is to rely on the assumption that no more than one gpu device will be used per MPI rank and that environment variables such as CUDA_VISIBLE_DEVICES/ intel vendor equivalent, or ONEAPI_DEVICE_SECTOR are used: as detailed in intel/llvm#15251

We do need to support the case where one MPI rank will open more than one GPU device. It is hard to understand all the details in the ticket you refer, but it appears more like a memory leak, which I assume can/has to be fixed. In any case, in general a MPI rank can open all the GPU devices.
Take a look to this example from CUDA: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-3-multiple-devices-per-thread. Tensorflow uses/needs such a use-case where a single process opens all the devices.

blazej-smorawski · 2024-10-07T09:56:30Z

rfcs/20240806-c-api/README.md

+   - `onecclResult_t onecclReleaseStream(oneccl_stream)`
+
+   `onecclResult_t onecclStreamDestroy(onecclStream_t oneccl_stream)`


So onecclStreamDestroy is the final version right?

blazej-smorawski · 2024-10-07T09:56:52Z

rfcs/20240806-c-api/README.md

+
+   `onecclResult_t onecclStreamDestroy(onecclStream_t oneccl_stream)`
+
+   Once the sycl::queue is registered, it is hidden behind the `cclStream_t`


cclStream_t -> onecclStream_t

Yes, this is a typo. @zhenggb72, can you update this and we can approve and merge?

JackAKirk · 2024-10-08T13:27:51Z

Have you considered this issue intel/llvm#15251 which is described more fully here: oneapi-src/unified-runtime#2077
Are there not similar implementation issues in the level_zero backend due to the implicit device setting nature of SYCL? I looked into the l0 adapter implementation: I think it is possible that l0 has similar issues, depending on ipc usage: I think it possible that if intel/llvm#15251 could be built for PVC you will see similar issues
I do not see how implementing a wrapper function to sycl that matches the api of cudaSetDevice such as suggested here: https://github.com/intel/llvm/ pull/15382 can solve such issues.
I think that the only current solution is to rely on the assumption that no more than one gpu device will be used per MPI rank and that environment variables such as CUDA_VISIBLE_DEVICES/ intel vendor equivalent, or ONEAPI_DEVICE_SECTOR are used: as detailed in intel/llvm#15251

We do need to support the case where one MPI rank will open more than one GPU device. It is hard to understand all the details in the ticket you refer, but it appears more like a memory leak, which I assume can/has to be fixed. In any case, in general a MPI rank can open all the GPU devices. Take a look to this example from CUDA: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-3-multiple-devices-per-thread. Tensorflow uses/needs such a use-case where a single process opens all the devices.

Thanks @garzaran that link was useful. I've opened a OPENMPI issue with a cuda reproducer that represents what happens in DPC++ here: open-mpi/ompi#12848

garzaran · 2024-10-11T22:37:40Z

Looks good

rscohn2 added the RFC label Aug 6, 2024

garzaran approved these changes Aug 6, 2024

View reviewed changes

C API RFC

a5a4356

rscohn2 force-pushed the dev/new-api branch from 23c7de5 to a5a4356 Compare August 6, 2024 20:56

blazej-smorawski reviewed Aug 7, 2024

View reviewed changes

blazej-smorawski approved these changes Oct 7, 2024

View reviewed changes

revised

928e842

zhenggb72 force-pushed the dev/new-api branch from c3a07a8 to 928e842 Compare October 11, 2024 05:24

garzaran approved these changes Oct 11, 2024

View reviewed changes

Merge branch 'rfcs' into dev/new-api

5340048

rscohn2 merged commit 7e4ff57 into oneapi-src:rfcs Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: C API #134

[RFC]: C API #134

rscohn2 commented Aug 6, 2024 •

edited

Loading

garzaran left a comment

blazej-smorawski Aug 7, 2024

blazej-smorawski Sep 24, 2024

zhenggb72 Oct 11, 2024

JackAKirk commented Sep 30, 2024 •

edited

Loading

garzaran commented Oct 4, 2024

blazej-smorawski Oct 7, 2024

blazej-smorawski Oct 7, 2024

garzaran Oct 11, 2024

JackAKirk commented Oct 8, 2024

garzaran commented Oct 11, 2024

		- `onecclResult_t onecclReleaseStream(oneccl_stream)`

		`onecclResult_t onecclStreamDestroy(onecclStream_t oneccl_stream)`


		`onecclResult_t onecclStreamDestroy(onecclStream_t oneccl_stream)`

		Once the sycl::queue is registered, it is hidden behind the `cclStream_t`

[RFC]: C API #134

[RFC]: C API #134

Conversation

rscohn2 commented Aug 6, 2024 • edited Loading

garzaran left a comment

Choose a reason for hiding this comment

blazej-smorawski Aug 7, 2024

Choose a reason for hiding this comment

blazej-smorawski Sep 24, 2024

Choose a reason for hiding this comment

zhenggb72 Oct 11, 2024

Choose a reason for hiding this comment

JackAKirk commented Sep 30, 2024 • edited Loading

garzaran commented Oct 4, 2024

blazej-smorawski Oct 7, 2024

Choose a reason for hiding this comment

blazej-smorawski Oct 7, 2024

Choose a reason for hiding this comment

garzaran Oct 11, 2024

Choose a reason for hiding this comment

JackAKirk commented Oct 8, 2024

garzaran commented Oct 11, 2024

rscohn2 commented Aug 6, 2024 •

edited

Loading

JackAKirk commented Sep 30, 2024 •

edited

Loading