CUDA Stream Compaction

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2

Kyle Bauer
- LinkedIn, twitter
Tested on: Windows 10, i-7 12700 @ 2.1GHz 32GB, NVIDIA T1000 4GB (CETS Virtual Lab)

Features

CPU Scan and Stream Compaction
Naive Scan
Work-Efficient Scan and Stream Compaction
Thrust Scan Wrapper

Analysis

The CPU, Naive, and Work-Efficient implementations all scaled similarly with an increasing array size. Generally, doubling the array size would double the runtime of each algorithm.

The CPU and Work-Efficient implementations compared very similarly, with the Work-Efficient runtimes never straying more than 3% away from the CPU runtimes.

The Naive implemenation's runtime diverged slightly from the CPU and Work-Efficient runtimes at around the 2^21 array size mark. In runs with a lesser element size than this, Naive performed up to 6% faster (at 2^20 elements) compared to the CPU implementation. And in runs with a greater element size, Naive performed at most 10% worse (at 2^24 elements) than the CPU implementation.

The Thrust implementation is clearly the overall most performant option, pulling completely away from all other implementations as the array size increases.

Potential Bottlenecks:

Global Memory: Both the Naive and Work-Efficient algorithms were implemented using global memory with no shared memory, creating a massive amount of overhead anytime the implementations wish to read or write data.
Memory Locality: Both the Naive and Work-Efficient algorithms read and write data across very large arrays. As the algorithms progress, these memory accesses become progressively more sparse- randomly accessing the memory will cause cache thrashing decreasing the bus utilization.
GPU Utilization: The Naive algorithm suffers from not saturating the GPU (Many threads are ended early leaving a couple of active threads in a warp). This inherently decreases parallelism and will increase the runtime as the array size grows.

Sample Output

****************
** SCAN TESTS **
****************
    [  33  40  46  12  48  15   5  37  39  42  27  41  35 ...  10   0 ]
==== cpu scan, power-of-two ====
   elapsed time: 27.4656ms    (std::chrono Measured)
    [   0  33  73 119 131 179 194 199 236 275 317 344 385 ... 410928744 410928754 ]
==== cpu scan, non-power-of-two ====
   elapsed time: 26.8243ms    (std::chrono Measured)
    [   0  33  73 119 131 179 194 199 236 275 317 344 385 ... 410928700 410928722 ]
    passed
==== naive scan, power-of-two ====
   elapsed time: 31.6926ms    (CUDA Measured)
    passed
==== naive scan, non-power-of-two ====
   elapsed time: 30.7692ms    (CUDA Measured)
    passed
==== work-efficient scan, power-of-two ====
   elapsed time: 23.55ms    (CUDA Measured)
    passed
==== work-efficient scan, non-power-of-two ====
   elapsed time: 23.0375ms    (CUDA Measured)
    passed
==== thrust scan, power-of-two ====
   elapsed time: 1.71158ms    (CUDA Measured)
    passed
==== thrust scan, non-power-of-two ====
   elapsed time: 1.14893ms    (CUDA Measured)
    passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
    [   0   3   2   0   2   0   0   2   0   0   3   3   2 ...   1   0 ]
==== cpu compact without scan, power-of-two ====
   elapsed time: 31.584ms    (std::chrono Measured)
    [   3   2   2   2   3   3   2   2   1   1   2   3   1 ...   1   1 ]
    passed
==== cpu compact without scan, non-power-of-two ====
   elapsed time: 35.5074ms    (std::chrono Measured)
    [   3   2   2   2   3   3   2   2   1   1   2   3   1 ...   2   2 ]
    passed
==== cpu compact with scan, power-of-two ====
   elapsed time: 74.7157ms    (std::chrono Measured)
    [   3   2   2   2   3   3   2   2   1   1   2   3   1 ...   1   1 ]
    passed
==== cpu compact with scan, non-power-of-two ====
   elapsed time: 73.4743ms    (std::chrono Measured)
    [   3   2   2   2   3   3   2   2   1   1   2   3   1 ...   2   2 ]
    passed
==== work-efficient compact, power-of-two ====
   elapsed time: 33.6798ms    (CUDA Measured)
    passed
==== work-efficient compact, non-power-of-two ====
   elapsed time: 24.5682ms    (CUDA Measured)
    passed

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
img		img
src		src
stream_compaction		stream_compaction
.cproject		.cproject
.gitignore		.gitignore
.project		.project
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
README.md		README.md
cis565_stream_compaction_test.launch		cis565_stream_compaction_test.launch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Stream Compaction

Features

Analysis

Sample Output

About

Releases

Packages

Languages

kbau121-seas/Project2-Stream-Compaction

Folders and files

Latest commit

History

Repository files navigation

CUDA Stream Compaction

Features

Analysis

Sample Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages