Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Zhihao Ruan #2

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
17ccf92
add clang-format & ignore vscode folder
Sep 15, 2021
466bb30
add _WIN32 guard around `system("pause");`
Sep 15, 2021
be9077e
finish CPU scan & stream compaction
Sep 16, 2021
7c58e96
WIP on naive parallel scan implementation
Sep 18, 2021
39ffd98
add __host__ and __device__ specifier to ilog2ceil
Sep 18, 2021
b553536
finish block-wise parallel scan kernel function
Sep 18, 2021
75e6900
enable printing of naive scan
Sep 18, 2021
43cf87f
define global block_size; add cudaDeviceSync()
Sep 18, 2021
d0d3de0
add handling of arbitrary length input on naive scan
Sep 19, 2021
3ba2ba3
add docs to arbitrary length array handling progress
Sep 19, 2021
c56536f
finish initial version of work-efficient scan
Sep 19, 2021
18326dd
increase block size to 1024
Sep 19, 2021
6b34b00
finish first version of work-efficient stream compaction
Sep 19, 2021
5069185
free all CUDA data arrays
Sep 19, 2021
d3372e5
finish recursive block scan in Naive scan
Sep 19, 2021
d7205a0
rewrite work-efficient scan as recursive scan on blocks
Sep 19, 2021
941e9f6
finish recursive block scan in work-efficient stream compaction
Sep 19, 2021
dbd6cf3
implement thrust scan
Sep 19, 2021
64cddb6
add sample log
Sep 19, 2021
5294179
add include "device_launch_parameters.h"
Sep 20, 2021
0488b13
Finish introduction of README
Sep 21, 2021
8befcb2
Add project highlights
Sep 21, 2021
a7d8e8d
add sample output in readme
Sep 21, 2021
cc22684
fix sample output array size error
Sep 21, 2021
1617aa9
Update README
Sep 21, 2021
7a23d9b
add separate block size for naive & efficient
Sep 21, 2021
7a8964e
calibrated optimal block size for both modes
Sep 21, 2021
84d3ee6
add profiling result & data
Sep 21, 2021
8f5d566
relabel y axis with correct unit
Sep 21, 2021
1419db4
add performance analysis in README
Sep 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ build
.LSOverride

# Icon must end with two \r
Icon
Icon


# Thumbnails
._*
Expand Down Expand Up @@ -560,3 +561,5 @@ xcuserdata
*.xccheckout
*.moved-aside
*.xcuserstate

.vscode
143 changes: 137 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,143 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Zhihao Ruan ([email protected])
* [LinkedIn](https://www.linkedin.com/in/zhihao-ruan-29b29a13a/), [personal website](https://zhihaoruan.xyz/)
* Tested on: Ubuntu 20.04 LTS, Ryzen 3700X @ 2.22GHz 48GB, RTX 2060 Super @ 7976MB

### (TODO: Your README)
## Highlights
This project implements:
- a naive parallel scan algorithm compatible with arbitrary sized input arrays;
- a work-efficient parallel scan algorithm compatible with arbitrary sized input arrays;
- a stream compaction algorithm built upon the work-efficient parallel scan compatible with arbitrary sized input arrays.

The GPU steam compaction algorithm is demonstrated to be over 4x faster than the CPU version.

A sample of test output on `block_size` = 1024, `array_size` = 2^27 **(max array possible on local GPU)**:
```
****************
** SCAN TESTS **
****************
[ 5 33 25 22 48 26 23 19 36 32 2 17 45 ... 22 0 ]
==== cpu scan, power-of-two ====
elapsed time: 79.9027ms (std::chrono Measured)
[ 0 5 38 63 85 133 159 182 201 237 269 271 288 ... -1006515866 -1006515844 ]
==== cpu scan, non-power-of-two ====
elapsed time: 81.4093ms (std::chrono Measured)
[ 0 5 38 63 85 133 159 182 201 237 269 271 288 ... -1006515949 -1006515918 ]
passed
==== naive scan, power-of-two ====
elapsed time: 31.3315ms (CUDA Measured)
[ 0 5 38 63 85 133 159 182 201 237 269 271 288 ... -1006515866 -1006515844 ]
passed
==== naive scan, non-power-of-two ====
elapsed time: 24.8398ms (CUDA Measured)
[ 0 5 38 63 85 133 159 182 201 237 269 271 288 ... 0 0 ]
passed
==== work-efficient scan, power-of-two ====
elapsed time: 37.6307ms (CUDA Measured)
[ 0 5 38 63 85 133 159 182 201 237 269 271 288 ... -1006515866 -1006515844 ]
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 37.6407ms (CUDA Measured)
[ 0 5 38 63 85 133 159 182 201 237 269 271 288 ... -1006515949 -1006515918 ]
passed
==== thrust scan, power-of-two ====
elapsed time: 3.16525ms (CUDA Measured)
[ 0 5 38 63 85 133 159 182 201 237 269 271 288 ... -1006515866 -1006515844 ]
passed
==== thrust scan, non-power-of-two ====
elapsed time: 3.12653ms (CUDA Measured)
[ 0 5 38 63 85 133 159 182 201 237 269 271 288 ... -1006515949 -1006515918 ]
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 0 3 1 1 1 3 0 1 2 1 1 1 2 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 250.09ms (std::chrono Measured)
[ 3 1 1 1 3 1 2 1 1 1 2 2 2 ... 3 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 247.095ms (std::chrono Measured)
[ 3 1 1 1 3 1 2 1 1 1 2 2 2 ... 2 3 ]
passed
==== cpu compact with scan ====
elapsed time: 886.643ms (std::chrono Measured)
[ 3 1 1 1 3 1 2 1 1 1 2 2 2 ... 3 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 58.4025ms (CUDA Measured)
[ 3 1 1 1 3 1 2 1 1 1 2 2 2 ... 3 3 ]
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 48.5331ms (CUDA Measured)
[ 3 1 1 1 3 1 2 1 1 1 2 2 2 ... 2 3 ]
passed
```

## Introduction: Stream Compaction
Stream compaction, essentially, is a technique that aims at removing elements from a list (aka. stream) that are not satisfied according to some criteria. For example, if we have a stream of integers `[1 2 3 2 1 5 23 4 0 0 3 4 2 0 3 8 0]` and we wish to remove "all elements that are 0" (aka. the *criteria*), we would get the remaining compact list `[1 2 3 2 1 5 23 4 3 4 2 3 8]`.

Stream compaction is widely used in rendering & ray tracing. Although it seems straightforward to implement stream compaction in the first place, it is actually non-trivial to implement it on GPU with some parallel algorithms so that its performance can be boosted. This project would discuss the method for parallel stream compaction, and its underlying necessary component — parallel scan algorithm.

**For more detailed description of the project, please refer to the [project instruction.](INSTRUCTION.md)**

## Parallel Scan
Parallel scan, aka. parallel prefix sum, is a task of generating a list of numbers in which each index is the sum of all elements that comes before this index. There are two types of parallel scan: *exclusive* scan and *inclusive scan*, where the former inserts 0 at the beginning of output and discards the total sum at the end of list, while the latter keeps the total sum at the end of list and does not introduce 0 at the beginning.

![](img/scan_inclusive_exclusive.png)

### Naive Parallel Scan
A naive algorithm of implementing parallel scan is shown as follows. For each iteration, part of the thread adds up two elements in the list, producing the final result after several iterations.

![](img/naive_scan.png)

### Work Efficient Parallel Scan
There is also a much more efficient version of parallel scan, which involves 1) a list reduction, 2) a down-sweep. The list reduction can also be called as the "up-sweep" procedure, producing a total sum of the list with all partial sums in the middle. The down-sweep procedure exactly compensates those missing parts for the middle elements and completes the entire parallel scan.

![](img/upsweep.png)

![](img/downsweep.png)

## Parallel Stream Compaction
After solving the problem of parallel scan, we can now get to the real algorithm for parallel stream compaction. Essentially, an effective stream compaction procedure consists of the following:
1. Generate a boolean array marking the validity of each element. For elements to remove, mark as "0"; otherwise mark as "1".
2. Compute exclusive parallel scan on the boolean array.
3. Scatter the desired elements into the output array. If an element is marked as "1" in the boolean array, store it into the corresponding indexed parallel scan position in the output array.

![](img/stream_compaction.png)

## Performance Analysis
**All the tests are conducted with random input array with `srand(0)` on local desktop.**

I roughly found the optimal block size for naive scan algorithm to be 256 and work-efficient scan algorithm to be 128. With these numbers tuned, I ran the program against multiple sizes of input arrays to evaluate the performance.

After careful evaluation, the current performance bottlenecks should be lying in:
1. Warp divergence and `__syncthreads()`. For both naive scan and work-efficient scan, the threads are utilized in an interleaved pattern, which leads to huge amount of warp divergence.
2. Global memory accesses are not coalesced. This is due to the same reason with (1), where we access global memory in an interleaved fashion.

Further improvements to the kernel functions includes re-index active threads to minimize warp divergence, as well as breaking work-efficient scan kernel into two small kernels (up-sweep and down-sweep) to eliminate the effect of `__syncthreads()` and warp divergence.

### Parallel Scan, Array Size Power-of-Two
In this diagram we can see that for large input data, CPU scan takes the most amount of time to run. For naive scan algorithm and work-efficient algorithm, both of them work similarly. When the data size is small, all four methods run roughly the same amount of time. Thrust outperforms all other three methods on large input data.

![](profiling/img/Figure_1.png)

### Parallel Scan, Array Size Non-Power-of-Two
In this diagram we can see that the four methods have roughly the same behaviors as in [array size of power of two.](#parallel-scan-array-size-power-of-two)

![](profiling/img/Figure_2.png)

### Stream Compaction, Array Size Power-of-Two
We can see that when data size is small, CPU compaction has roughly the same performance as work-efficient compaction. However, as the data size increases, GPU compaction outperforms CPU compaction.

![](profiling/img/Figure_3.png)

### Stream Compaction, Array Size Non-Power-of-Two
This diagram shows similar behaviors as in [array size of power of two.](#stream-compaction-array-size-power-of-two)

![](profiling/img/Figure_4.png)

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

9 changes: 9 additions & 0 deletions cmake/.clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
BasedOnStyle: Google
---
Language: Cpp
AccessModifierOffset: -2
AlignConsecutiveAssignments: true
AlignConsecutiveMacros: true
---

91 changes: 46 additions & 45 deletions cmake/cuda_compute_capability.cpp
Original file line number Diff line number Diff line change
@@ -1,58 +1,59 @@
/*
* Copyright (C) 2011 Florian Rathgeber, [email protected]
*
* This code is licensed under the MIT License. See the FindCUDA.cmake script
* for the text of the license.
*
* Based on code by Christopher Bruns published on Stack Overflow (CC-BY):
* http://stackoverflow.com/questions/2285185
*/
* Copyright (C) 2011 Florian Rathgeber, [email protected]
*
* This code is licensed under the MIT License. See the FindCUDA.cmake script
* for the text of the license.
*
* Based on code by Christopher Bruns published on Stack Overflow (CC-BY):
* http://stackoverflow.com/questions/2285185
*/

#include <stdio.h>
#include <cuda_runtime.h>
#include <stdio.h>

#include <iterator>
#include <set>

int main() {
int deviceCount;
int gpuDeviceCount = 0;
struct cudaDeviceProp properties;
int deviceCount;
int gpuDeviceCount = 0;
struct cudaDeviceProp properties;

if (cudaGetDeviceCount(&deviceCount) != cudaSuccess)
{
printf("Couldn't get device count: %s\n", cudaGetErrorString(cudaGetLastError()));
return 1;
}
if (cudaGetDeviceCount(&deviceCount) != cudaSuccess) {
printf("Couldn't get device count: %s\n",
cudaGetErrorString(cudaGetLastError()));
return 1;
}

std::set<int> computes;
typedef std::set<int>::iterator iter;
std::set<int> computes;
typedef std::set<int>::iterator iter;

// machines with no GPUs can still report one emulation device
for (int device = 0; device < deviceCount; ++device) {
int major = 9999, minor = 9999;
cudaGetDeviceProperties(&properties, device);
if (properties.major != 9999) { // 9999 means emulation only
++gpuDeviceCount;
major = properties.major;
minor = properties.minor;
if ((major == 2 && minor == 1)) {
// There is no --arch compute_21 flag for nvcc, so force minor to 0
minor = 0;
}
computes.insert(10 * major + minor);
}
}
int i = 0;
for(iter it = computes.begin(); it != computes.end(); it++, i++) {
if(i > 0) {
printf(" ");
}
printf("%d", *it);
// machines with no GPUs can still report one emulation device
for (int device = 0; device < deviceCount; ++device) {
int major = 9999, minor = 9999;
cudaGetDeviceProperties(&properties, device);
if (properties.major != 9999) { // 9999 means emulation only
++gpuDeviceCount;
major = properties.major;
minor = properties.minor;
if ((major == 2 && minor == 1)) {
// There is no --arch compute_21 flag for nvcc, so force minor to 0
minor = 0;
}
computes.insert(10 * major + minor);
}
/* don't just return the number of gpus, because other runtime cuda
errors can also yield non-zero return values */
if (gpuDeviceCount <= 0 || computes.size() <= 0) {
return 1; // failure
}
int i = 0;
for (iter it = computes.begin(); it != computes.end(); it++, i++) {
if (i > 0) {
printf(" ");
}
return 0; // success
printf("%d", *it);
}
/* don't just return the number of gpus, because other runtime cuda
errors can also yield non-zero return values */
if (gpuDeviceCount <= 0 || computes.size() <= 0) {
return 1; // failure
}
return 0; // success
}
Binary file added img/downsweep.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/naive_scan.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan_inclusive_exclusive.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/stream_compaction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/upsweep.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading