CUDA runtime api #262

psmyth94 · 2024-06-20T22:58:46Z

Hello,

I started working on the implementation for the cuda runtime api since I saw some interest in it (#200). I managed to translate the cuda runtime api equivalent for most of the functions in the driver api except for context and module management, which is handled automatically by cudart. Below are some of the issues/limitations:

No cudaOccupancyMaxPotentialBlockSize.*
- bindgen isn't able to generate templated static inline functions [see their unsuppoted-features]
Cannot use cudaLaunchKernel via FFI
- Using cudaLaunchKernel via FFI bindings in Rust isn't possible AFAIK because CUDA runtime expects a specific binary layout to find and launch compiled kernels. Rust's FFI mechanism doesn't natively conform to this layout, preventing the CUDA runtime from resolving and executing kernel functions properly.
- The only way to use the runtime api is by calling precompiled wrappers that use the CUDA <<<...>>> syntax, which what I did with testkernel.cu.

I only have bindings for cuda-12050 and 12020 so far. I want to gauge community interest in this implementation before investing further time.

Edit: the issue with cudaLaunchKernel has been fixed. Please read below

in their name

ya0guang · 2024-06-25T20:28:21Z

Hi Patrick, I'm interested in your proposal! I'm trying to add support for cuda 12.4 but I don't understand how to generate the source file like sys_12040.rs using the bindgen.sh script. I see a line here: CUDART_VERSION=$(cat tmp.rs | grep "CUDART_VERSION" | awk '{ print $6 }' | sed 's/.$//'). What is expected from tmp.rs?

Thanks for your effort for porting CUDA runtime!

ya0guang · 2024-06-25T20:50:59Z

Sorry there is some problem with my rustfmt. the bindgen script works on my side now with CUDA 12.4 and Ubuntu 22.04. All tests from the runtime pass. I'll dive deeper into it, thanks!

coreylowman · 2024-06-28T17:00:08Z

Cannot use cudaLaunchKernel via FFI

Yeah this was the main problem that I ran into as well. I think with rust we always would have to rely on the driver api to call kernels. This is why at this point I just chose to go with driver api.

Do we gain anything with adding runtime api?

At the very least we should document this shortcoming

ya0guang · 2024-06-28T17:10:08Z

My understanding is runtime API works at a higher level and is easier to deal with for developers. From CUDA developer guide:

The runtime API eases device code management by providing implicit initialization, context management, and module management. This leads to simpler code, but it also lacks the level of control that the driver API has.

I believe an alternative way to do cudaLaunchKernel is to launch kernel using cuLaunchKernel, as it cannot be directly implemented via this runtime API call. Another project for CUDA API remoting, cricket, adopted this way: https://github.com/RWTH-ACS/cricket/blob/ce8fdf7d4f4df696cf65c6d35926a76443d18f28/cpu/cpu-server-runtime.c#L875

ahem · 2024-07-06T07:41:24Z

If you are trying to gauge community interest, then I at least would be super interested in this. I am working on a project that provides Rust support for a large-ish in-house C/C++ project that is build exclusively with the Runtime API. Since many of the types are interchangeable with the driver API this is doable already, but would of course be much easier with direct bindings to the Runtime API!

We are still using CUDA 11, though...

psmyth94 · 2024-07-08T17:11:49Z

Hey @ahem, that's awesome to hear. As for using CUDA 11, I don't think the code will vastly differ. The only difference is that I don't think the primary context is initialized after cudaSetDevice() in CUDA 11. I think we can use a no-op call via cudaFree(0) to init the primary context, tho. I can test this once I set-up an env with CUDA 11.8.

psmyth94 · 2024-07-19T11:42:42Z

CUDA versions from 11.4.0 to 12.5.0 have been added. Did not encounter any issues for the tests. I also added a module in result called version that checks which versions of runtime api and driver api is currently being used (instead of using features to check).

Note though, while my tests on 11.x were using the bindings of 11.x for both driver and runtime api, my libcuda.so was still 12.5.0. This hasn't been properly tested on NVIDIA drivers using cuda-11.x.

therishidesai · 2024-09-05T23:10:54Z

I am very interested in the work here. I was trying to previously use cust but it doesn't support Pitched memory and unfortunately it doesn't look like the project is actively maintained.

I was looking at the diff here and I don't see support for pitched memory and page-locked memory. These are both going to be useful to have if more people are going to use this crate.

coreylowman · 2024-09-06T18:07:53Z

I think my biggest concern here is just the amount of shared code between driver & runtime that would need to be maintained together.

If we are maintaining the exact same api between driver::safe::CudaDevice and runtime::safe::CudaDevice, I'm wondering if there's a better way to do all of this without having copies of everything (Although it does make it less complex because we don't need extra abstractions).

I guess I'm wondering: for downstream crates, what are the differences when using driver vs runtime?

If there aren't really any differences, we could probably just stick to exposing the sys/result for runtime, but not expose a safe api for runtime at all.

Thoughts?

therishidesai · 2024-09-06T18:58:51Z

At least for memory I think cudarc should take some inspiration from the cust::memory module. It would be nice to have a safe rust way of managing all of the forms of CUDA memory.

I do agree with @coreylowman that the current work seems to be a lot like the unsafe API.

psmyth94 · 2024-09-07T20:39:21Z

I think my biggest concern here is just the amount of shared code between driver & runtime that would need to be maintained together.

If we are maintaining the exact same api between driver::safe::CudaDevice and runtime::safe::CudaDevice, I'm wondering if there's a better way to do all of this without having copies of everything (Although it does make it less complex because we don't need extra abstractions).

I guess I'm wondering: for downstream crates, what are the differences when using driver vs runtime?

If there aren't really any differences, we could probably just stick to exposing the sys/result for runtime, but not expose a safe api for runtime at all.

Thoughts?

Yeah, I agree there's a lot of code copying. I looked into a more integrated approach, but it would require a lot of abstraction through traits, like having CudaBlasLT accept CudaDevice from either runtime or driver. Personally, I'm not a fan of abstraction. My idea of having it separate, is that even if you update the driver API, it won't cause integration issues with runtime. Managing the abstraction would be more cumbersome IMO (in the short term at least).

From what I gather, users seem to prefer using this as a standalone module rather than integrating it with the rest.

My thoughts regarding whether to include a safe API or not, I think it’s more practical (and less time-consuming) to stick with the sys/result for now. We can always allow contributions toward a safe API later on.

At least for memory I think cudarc should take some inspiration from the cust::memory module. It would be nice to have a safe rust way of managing all of the forms of CUDA memory.

Yeah, I agree. While page-locked memory isn't exclusive to cudart, it does make things easier.

I do agree with @coreylowman that the current work seems to be a lot like the unsafe API.

The result module is a thin wrapper around sys, which is why it has that similarity.

psmyth94 · 2024-10-11T15:03:28Z

Had to step away from this for a bit. I removed the safe API and made changes to pass all the checks. Should be good to go now.

HenrikStenbyAndresen · 2024-11-09T10:15:05Z

I'm interested in this PR. Some of the runtime support for memory, streams etc. etc. would be really interesting in terms of QoL

Edit: I can't see any page-locked host memory allocation. Is that on purpose or an oversight?

psmyth94 added 22 commits June 7, 2024 09:44

add rust bindings cudart using bindgen

651df01

add cudart as a target for binding generation

3d68169

change the name to runtime

c61f530

change the name to runtime instead of cudart

366b0f6

rename module to runtime instead of cudart

82c22f4

convert driver error to runtime error

44136c6

removed lib since runtime does not need libloading

1c95532

added stream, device, and function

c3696cc

add sys_12020

46727b0

modify bindings

490547b

fix: update function to reflect the runtime api

3f115cf

add cuda_runtime.h in wrapper and bind objects with both CUDA and cuda

161722b

in their name

initial commit of the safe implementation of runtime api

f85ec2f

first final draft of cuda runtime api in cudarc

44717be

create test kernels

a6a2596

change tests to use statically compiled code

9016c92

use cudaFunction_t instead of raw pointer

0332cd7

fix: add missing sys

dc6f817

run cargo fix

886fea6

Merge remote-tracking branch 'origin' into cudart

899822d

remove get_function_by_symbol; no idea how it works

7fbbcc3

add missing import

f635a00

psmyth94 changed the title ~~Cudart~~ CUDA runtime api Jun 21, 2024

psmyth94 added 3 commits July 1, 2024 10:38

add ptx loading via driver api

731085c

replace device pointers as raw pointers and remove _driver_device

7d96aed

no longer need to compile test kernels

f63a2d3

Merge branch 'coreylowman:main' into cudart

9934fc1

psmyth94 marked this pull request as ready for review July 2, 2024 20:57

coreylowman mentioned this pull request Jul 16, 2024

cudarc fails to load libraries on official nvidia ubuntu images #274

Closed

psmyth94 added 9 commits July 18, 2024 12:41

add version management functions

cca6aa0

add cudaFree

0026837

remove unused DriverError

a5d1d17

redundant conversion

10a3c46

add cuda version 11.8

3aa26e8

Merge branch 'coreylowman:main' into cudart

dab3f60

lib retrieval handled internally

4951394

add all supported cuda versions

7fadea0

update sys_12020

b2f34c2

psmyth94 added 3 commits August 9, 2024 16:14

add 12060

951b94e

Merge branch 'main' into cudart

8925020

add runtime in the defaults

57e7faa

psmyth94 added 6 commits October 11, 2024 09:11

Merge branch 'main' of github.com:coreylowman/cudarc into cudart

2f34c2f

add 11040, 11050, 11060, and 12060 in results.rs and sys/mod.rs

5c97bff

remove the safe api

ffc2bfc

remove documentation about safe api and remove safe api from mod

c7342f0

cuda 12.0.0 doesn't support cudaDeviceSyncMemops

7bc94b9

add 12040

f045bf7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA runtime api #262

CUDA runtime api #262

psmyth94 commented Jun 20, 2024 •

edited

Loading

ya0guang commented Jun 25, 2024

ya0guang commented Jun 25, 2024

coreylowman commented Jun 28, 2024

ya0guang commented Jun 28, 2024

ahem commented Jul 6, 2024

psmyth94 commented Jul 8, 2024 •

edited

Loading

psmyth94 commented Jul 19, 2024

therishidesai commented Sep 5, 2024

coreylowman commented Sep 6, 2024

therishidesai commented Sep 6, 2024

psmyth94 commented Sep 7, 2024

psmyth94 commented Oct 11, 2024

HenrikStenbyAndresen commented Nov 9, 2024 •

edited

Loading

CUDA runtime api #262

Are you sure you want to change the base?

CUDA runtime api #262

Conversation

psmyth94 commented Jun 20, 2024 • edited Loading

ya0guang commented Jun 25, 2024

ya0guang commented Jun 25, 2024

coreylowman commented Jun 28, 2024

ya0guang commented Jun 28, 2024

ahem commented Jul 6, 2024

psmyth94 commented Jul 8, 2024 • edited Loading

psmyth94 commented Jul 19, 2024

therishidesai commented Sep 5, 2024

coreylowman commented Sep 6, 2024

therishidesai commented Sep 6, 2024

psmyth94 commented Sep 7, 2024

psmyth94 commented Oct 11, 2024

HenrikStenbyAndresen commented Nov 9, 2024 • edited Loading

psmyth94 commented Jun 20, 2024 •

edited

Loading

psmyth94 commented Jul 8, 2024 •

edited

Loading

HenrikStenbyAndresen commented Nov 9, 2024 •

edited

Loading