CUDA implementation of parallel radix sort using Blelloch scan
- Implementation of 4-way radix sort as described in this paper by Ha, Krüger, and Silva
- 2 bits per pass, resulting in 4-way split each pass
- No order checking at every pass yet
- Each block's internal scans now use Hillis-Steele instead of Blelloch, since the internal scan's input size is roughly the same size as the number of threads per block. In this case, Hillis-Steele's larger work complexity than Blelloch's is worth having for Hillis-Steele halving the span of Blelloch's.
- Each block sorts its own local portion of the global array for greater memory coalescing during global shuffles
- Prefix summing the global block sums uses the large-scale bank-conflict free Blelloch scan, which in turn uses the padded addressing solution for bank conflicts, described in this presentation by Mark Harris
- For randomly ordered 134 million unsigned ints, this outperforms
std::sort()
by about 9.84x - For descendingly ordered 134 million unsigned ints, this outperforms
std::sort()
by about 1.30x - The results above were observed using a p2.xlarge AWS instance running the NVIDIA CUDA Toolkit 7.5 AMI. The instance is equipped with 12 EC2 Compute Units (4 virtual cores), plus 1 NVIDIA K80 (GK210) GPU.