CUB device sort with NVRTC #1590

m-schuetz · 2024-04-04T16:48:11Z

m-schuetz
Apr 4, 2024

Hi, I've been trying to add CUB to my nvrtc-and-cooperative-group-based CUDA project to do some device-wide sorting. It seems like there is some very recent development in actually enabling CUB in combination with nvrtc, which is awesome!

I've been loosely following this example to make it work and I was able to compile with #include <cub/warp/warp_reduce.cuh>, but things fell apart when trying to include #include <cub/device/device_radix_sort.cuh>. There were some errors with stdio or other includes that could not be found, and after adding some more include paths I ultimately ended up with following error:

[...]/libs/cccl-main/cub/cub/detail/choose_offset.cuh(64): error: namespace "std" has no member "uint32_t"
    using type = typename ::cuda::std::conditional<sizeof(NumItemsT) <= 4, std::uint32_t, unsigned long long>::type;
                                                                                ^

[...]/libs/cccl-main/cub/cub/detail/choose_offset.cuh(87): error: namespace "std" has no member "int32_t"
    using type = typename ::cuda::std::conditional<sizeof(NumItemsT) < 4, std::int32_t, NumItemsT>::type;
                                                                               ^

[...]/libs/cccl-main/libcudacxx/include/cuda/std/detail/libcxx/include/stdint.h(129): catastrophic error: cannot open source file "stdint.h"
  #include_next <stdint.h>

Are the device-wide sort algorithms not ready for nvrtc yet, or am I doing something wrong?

nvrtcCompileProgram arguments:

--gpu-architecture=compute_89
--use_fast_math
--extra-device-vectorization
-lineinfo
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/cub
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/libcudacxx/include
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/libcudacxx/include/cuda/std/detail/libcxx/include/
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/libcudacxx/include/cuda/std
-I D:/dev/workspaces/CudaPlayground/rasterizer/libs/cccl-main/thrust
-I C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4/include
--relocatable-device-code=true
-default-device
-dlto
--std=c++17

Relevant discussions:

jrhemstad · 2024-04-18T16:51:14Z

jrhemstad
Apr 18, 2024
Maintainer

Hey @m-schuetz, apologies I never followed up on this when I intended to.

So the problem you're running into here is that the device-wide CUB APIs (i.e., cub::Device*) don't work with NVRTC and this is expected.

The reason for this is because NVRTC can only compile device code (a __global__ or __device__ function). It cannot compile host code.

For example, if you had a kernel like:

__global__ void kernel(...){
...
}

NVRTC can compile this just fine. However, imagine you had a host function that launches this kernel:

void host_function(...){
   do_stuff();
   kernel<<<....>>>(....);
   do_other_stuff();
}

NVRTC cannot compile host_function() because it contains host code.

The cub::Device* APIs are like host_function() here. They contain host code that do host-side business logic of allocation temporary memory, orchestrating launch one or more kernels, etc.

As a result, NVRTC cannot compile the cub::Device* APIs in the way that you'd like.

There is one potential trick/hack that technically would allow you to compile a cub::Device* API with NVRTC. Most cub::Device* APIs support being called from device code and use CUDA Dynamic Parallelism to launch child kernels from within a parent kernel. So you could do:

__global__ void kernel(){
   cub::DeviceSort(...); // this launches a child kernel
}

Now, because kernel is a __global__ function, NVRTC could compile this. I know it might seem like a contrived limitation, but it results from how NVRTC works and its out of CUB's control to do anything about it.

(edit: disregard the stuff about CUDA Dynamic Parallelism. This doesn't actually work today.)

0 replies

m-schuetz · 2024-04-18T17:54:09Z

m-schuetz
Apr 18, 2024
Author

Hi, thanks for the response! Yeah, I see now that a device-wide algorithm meant for compile-time CUDA isn't easily ported to runtime CUDA. The scenario I was going for is being able to sort large data sets within a persistent-threads-megakernel fully on device without involving the host. Had some great experiences doing that with a custom counting sort in the past, so I guess now it's time for me to attempt implementing a radix sort in a persistent-thread fashion! :)

1 reply

jrhemstad Apr 18, 2024
Maintainer

It's conceivable that some point in the future we could offer device-wide CUB algorithms that are cooperatively invoked from within a kernel by all threads in a grid (assuming a cooperative kernel launch). Unfortunately that's not on the radar for now 🙂.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUB device sort with NVRTC #1590

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

CUB device sort with NVRTC #1590

m-schuetz Apr 4, 2024

Replies: 2 comments · 1 reply

jrhemstad Apr 18, 2024 Maintainer

m-schuetz Apr 18, 2024 Author

jrhemstad Apr 18, 2024 Maintainer

m-schuetz
Apr 4, 2024

Replies: 2 comments 1 reply

jrhemstad
Apr 18, 2024
Maintainer

m-schuetz
Apr 18, 2024
Author

jrhemstad Apr 18, 2024
Maintainer