-
I have read an article and there are three different methods to get the prefix sum of an array, 1 is using global memory, 2 is using shared memory, 3 is using thrust::inclusive_scan, every way with 20 similar test. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
It might be helpful to share said article so we know what you mean. Assuming enough data I'm sure that CUB's implementation uses both shared memory for local scans and global memory for communication between blocks to update the local scans with the final term from previous blocks. As far as I know, Single-pass Parallel Prefix Scan with Decoupled Look-back describes what CUB is doing. Could you also provide more information on that slow test case? How many elements are scanned? What GPU are you using? |
Beta Was this translation helpful? Give feedback.
@pauleonix is exactly right. CUB is using the single-pass prefix scan to minimize incurred memory traffic. That is, to communicate partial and inclusive prefix scan results of each tile (I am referring to "a tile", as the items that one thread block processes). To compute the results within one tile, you can assume that the CUB implementation is at least as sophisticated as the shared memory variant you were referring to above.
In general, I would strongly advise to use the CUB algorithms. There's a lot of thought that went into the design of these algorithms and you don't have to ma…