Replies: 2 comments 1 reply
-
Hey @m-schuetz, apologies I never followed up on this when I intended to. So the problem you're running into here is that the device-wide CUB APIs (i.e., The reason for this is because NVRTC can only compile device code (a For example, if you had a kernel like:
NVRTC can compile this just fine. However, imagine you had a host function that launches this kernel:
NVRTC cannot compile The As a result, NVRTC cannot compile the
(edit: disregard the stuff about CUDA Dynamic Parallelism. This doesn't actually work today.) |
Beta Was this translation helpful? Give feedback.
-
Hi, thanks for the response! Yeah, I see now that a device-wide algorithm meant for compile-time CUDA isn't easily ported to runtime CUDA. The scenario I was going for is being able to sort large data sets within a persistent-threads-megakernel fully on device without involving the host. Had some great experiences doing that with a custom counting sort in the past, so I guess now it's time for me to attempt implementing a radix sort in a persistent-thread fashion! :) |
Beta Was this translation helpful? Give feedback.
-
Hi, I've been trying to add CUB to my nvrtc-and-cooperative-group-based CUDA project to do some device-wide sorting. It seems like there is some very recent development in actually enabling CUB in combination with nvrtc, which is awesome!
I've been loosely following this example to make it work and I was able to compile with
#include <cub/warp/warp_reduce.cuh>
, but things fell apart when trying to include#include <cub/device/device_radix_sort.cuh>
. There were some errors with stdio or other includes that could not be found, and after adding some more include paths I ultimately ended up with following error:Are the device-wide sort algorithms not ready for nvrtc yet, or am I doing something wrong?
Relevant discussions:
Beta Was this translation helpful? Give feedback.
All reactions