Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: C API #134

Merged
merged 3 commits into from
Oct 18, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions rfcs/20240806-c-api/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# C API Design Document (RFC)


## Introduction

The oneCCL communication library’s current APIs is defined in the [oneAPI
specification][ccl-spec]. However, other APIs used by similar collective
communication libraries differ from those used by oneCCL. For example, see
[NCCL][nccl-spec] from Nvidia, [RCCL][rccl-spec] from AMD, and hccl from
Habana. This RFC asks for feedback about aligning the oneCCL APIs to be closer
to other vendor libraries, since this facilitates integration with frameworks
and upstreaming to the open source.

One difference between oneCCL and other vendors communication libraries is that
all other communication libraries have a C API, while oneCCL has a C++ API.
This is because oneCCL was designed to integrate with SYCL, which is based on
C++. One of the goals of oneCCL is to support different hardware and vendors,
such as Intel Data Center GPU Max Series, Intel Core and Intel Xeon family,
Intel Gaudi, Nvidia or AMD GPUs, among others.

[ccl-spec]: https://uxlfoundation.github.io/oneAPI-spec/spec/elements/oneCCL/source/index.html
[hccl-spec]: https://docs.habana.ai/en/latest/API_Reference_Guides/HCCL_APIs/C_API.html
[nccl-spec]: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api.html
[rccl-spec]: https://rocm.docs.amd.com/projects/rccl/en/latest/api-reference/api-library.html#api-library

## Proposal

The proposal is to define a C-like API that aligns with current APIs in other
communication libraries, while introducing a few changes, as described next:

1. Most APIs are C-based like other communication libraries. C++ data
structures are hidden behind handles returned to the user, such as
`ccl::stream` and `ccl::comm`.

2. The API is extended with two C++ API functions to support `sycl::queue`:

- `onecclResult_t onecclCreateStream(sycl::queue, &oneccl_stream)`
- `onecclResult_t onecclReleaseStream(oneccl_stream)`

Once the sycl::queue is registered, it is hidden behind the ccl stream
handle

3. Add functions to allow users to explicitly control the lifetime of objects,
instead of relying on the C++ destructors

- `onecclResult_t onecclCommFinalize(comm)`
- `onecclResult_t onecclCommDestroy(comm)`

4. Drop support for out-of-order SYCL queue and SYCL buffers. The current
oneCCL library support out of order SYCL queues, but this feature is not
used by the users of the library. In general, the collective operations are
submitted to an in-order queue. When out-of order behavior is required,
commands are submitted to a different in-order queue, and the two queues are
synchronized.

5. Drop support for SYCL buffers. Only [Unified Shared Memory][usm-example] is
supported.

[usm-example]: https://www.intel.com/content/www/us/en/developer/articles/code-sample/dpcpp-usm-code-sample.html

### APIs

The tables below contain the NCCL API, the corresponding new proposed oneCCL
API, and the current oneCCL API.

#### APIs related with communicator creation.

| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
|-------------------|------------------------------|-------------------------|
|`cudaError_t` |`onecclResult_t cudaSetDevice(device)(1)`| N/A |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the device? We should be much more specific here. I think we don't have to cover this API. My proposal is to handle all device handles in ...Config APIs, so for example. For example passing users device to onecclCommInit would look like:

onecclCommConfig_t config = ONECCL_COMM_CONFIG_INITIALIZER;
config.sycl.queue = &queue;

onecclCommInitConfig(rank, size, ..., &config);

Similar pattern could be applied to onecclStreamCreate, so we could get rid of onecclCreateSyclStream which looks like a workaround rather than an elegant solution:

onecclStream_t stream;
onecclStreamConfig_t stream_config = ONECCL_STREAM_SYCL_CONFIG_INITIALIZER;
// or onecclStreamConfig_t config = ONECCL_STREAM_CONFIG_INITIALIZER;
config.sycl.queue = &queue;

onecclStreamCreateConfig(&stream, &config);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed to use integer index for the devices temporarily, before SYCL releases API for device selection similar to cudaSetDevice.

For the streams, I think that onecclCreateStreamXpu(&onecclStream_t stream_ptr, void* args) should be enough for all communications backed we would like to support.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

|`ncclResult_t ncclGetUniqueId (id)`| `onecclResult_t onecclGetUniqueId (id)`| `ccl::create_main_kvs(); ccl::create_kvs(main_addr);`|
|`ncclResult_t ncclCommInitRank(comm, size, id, rank)`|`onecclResult_t onecclCommInitRank(comm, size, id, rank)`|`comm cl::create_communicator(size, rank, device, context, kvs) comms ccl:create_communicators(size, rank, device, context, kvs)`|
|`ncclResult_t ncclCommInitRankConfig(comm, size, id, rank, attr)`|`onecclResult_t onecclCommInitRankConfig(comm, size, id, rank, attr)`|`comm ccl:create_communicator(size, rank, device, context, kvs, attr)`|
|`ncclResult_t ncclCommInitAll (comms, ndev, dev_list)`|`onecclResult_t onecclCommInitAll(comms,ndev,dev_list)`| Not currently available.Working on adding support.|
|`ncclCommSplit` | Not implemented | Not implemented |
|`nccltResult ncclCommFinalize(comm)`|`onecclResult_t onecclCommFinalize(comm)`| N/A |
|`ncclResult_t ncclCommDestroy(comm)`|`onecclResult_t onecclCommDestroy(comm)`| Destructor |

Notice that cudaSetDevice(device) is a CUDA call, not a NCCL call. If an
equivalent call is available in SYCL (or calling language), the proposed
onecclSetDevice(device) will not be needed.

#### APIs related with Collective Communication operations

| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
|-------------------|------------------------------|-------------------------|
|`ncclResult_t ncclAllgather (sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclAllgather(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::allgather (2) (sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
|`ncclResult_t ncclAllreduce(sendbuff,recvbuff, count, datatype, op, comm, stream)`|`onecclResult_t onecclAllreduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event
communicator::allreduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
|`ncclResult_t ncclBroadcast(sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclBroadcast(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::broadcast (3) (sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
|`ncclResult_t ncclReduce(sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclReduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::reduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
|`ncclResult_t ncclReduceScatter(sendbuff,recvbuff, count, datatype, op, comm, stream)`|`onecclResult_t onecclReduceScatter(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::reduce_scatter(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
| N/A |`onecclAlltoall onecclAlltoallv` We could deprecate|`communicator::alltoall communicator::alltoallv`|
| N/A |`onecclBarrier` We could deprecate and use Allreduce with 1 Byte|`ccl::event communicator::barrier`|

- Currently oneCCL contains Allgatherv, but this will be deprecated in the
future
- The current API is slightly different, but the next oneCCL release will align
the Broadcast with the one shown here

#### Group APIs

| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
|-------------------|------------------------------|-------------------------|
|`ncclResult_t ncclGroupStart()`|`onecclResult_t onecclGroupStart()`| N/A |
|`ncclResult_t ncclGroupEnd()` |`onecclResult_t onecclGroupEnd()` | N/A |

#### Point to Point APIs

| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
|-------------------|------------------------------|-------------------------|
|`ncclResult_t ncclSend(sendbuf, count, datatype, peer, comm, stream)`|`onecclResult_t onecclSend(sendbuf, count, datatype, peer, comm, oneccl_stream)`|`ccl::event communicator::send(sendbuf, count,datatype, peer, comm, oneccl_stream)`|
|`ncclResult_t ncclRecv(…)`|`onecclResult_t onecclRecv(…)`|`communicator::recv`|

#### Other APIs

| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
|-------------------|------------------------------|-------------------------|
|`ncclResult_t ncclCommCount(comm, size)`|`onecclResult_t onecclCommCount(comm, size)`|`size communicator::size()`|
|`ncclResult_t ncclCommCuDevice(comm, device)`|`onecclResult_t onecclCommGetDevice(comm, device)`|`device communicator::get_device()`|
|`ncclResult_t ncclCommUserRank(comm, rank)`|`onecclResult_t onecclCommUserRank(comm, rank)`|`rank communicator::rank()`|
|`ncclResult_t ncclGetVersion(version)`|`onecclResult_t onecclGetVersion(version)`|`version ccl:get_library_version()`|
|`ncclCommAbort` | Not implemented | N/A |
|`ncclCommGetAsyncError`| Not implemented | N/A |
|`ncclGetLastError` | Not implemented | N/A |
|`ncclGetErrorString`| Not implemented | N/A |