CIS565-Fall-2021 · shineyruan · Sep 15, 2021 · Sep 15, 2021 · Sep 16, 2021 · Sep 18, 2021
diff --git a/.gitignore b/.gitignore
@@ -25,7 +25,8 @@ build
 .LSOverride
 
 # Icon must end with two \r
-Icon
+Icon
+
 
 # Thumbnails
 ._*
@@ -560,3 +561,5 @@ xcuserdata
 *.xccheckout
 *.moved-aside
 *.xcuserstate
+
+.vscode
diff --git a/README.md b/README.md
@@ -3,12 +3,143 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Zhihao Ruan ([email protected])
+  * [LinkedIn](https://www.linkedin.com/in/zhihao-ruan-29b29a13a/), [personal website](https://zhihaoruan.xyz/)
+* Tested on: Ubuntu 20.04 LTS, Ryzen 3700X @ 2.22GHz 48GB, RTX 2060 Super @ 7976MB
 
-### (TODO: Your README)
+## Highlights
+This project implements:
+- a naive parallel scan algorithm compatible with arbitrary sized input arrays;
+- a work-efficient parallel scan algorithm compatible with arbitrary sized input arrays;
+- a stream compaction algorithm built upon the work-efficient parallel scan compatible with arbitrary sized input arrays.
+
+The GPU steam compaction algorithm is demonstrated to be over 4x faster than the CPU version.
+
+A sample of test output on `block_size` = 1024, `array_size` = 2^27 **(max array possible on local GPU)**:
+```
+****************
+** SCAN TESTS **
+****************
+    [   5  33  25  22  48  26  23  19  36  32   2  17  45 ...  22   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 79.9027ms    (std::chrono Measured)
+    [   0   5  38  63  85 133 159 182 201 237 269 271 288 ... -1006515866 -1006515844 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 81.4093ms    (std::chrono Measured)
+    [   0   5  38  63  85 133 159 182 201 237 269 271 288 ... -1006515949 -1006515918 ]
+    passed 
+==== naive scan, power-of-two ====
+   elapsed time: 31.3315ms    (CUDA Measured)
+    [   0   5  38  63  85 133 159 182 201 237 269 271 288 ... -1006515866 -1006515844 ]
+    passed 
+==== naive scan, non-power-of-two ====
+   elapsed time: 24.8398ms    (CUDA Measured)
+    [   0   5  38  63  85 133 159 182 201 237 269 271 288 ...   0   0 ]
+    passed 
+==== work-efficient scan, power-of-two ====
+   elapsed time: 37.6307ms    (CUDA Measured)
+    [   0   5  38  63  85 133 159 182 201 237 269 271 288 ... -1006515866 -1006515844 ]
+    passed 
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 37.6407ms    (CUDA Measured)
+    [   0   5  38  63  85 133 159 182 201 237 269 271 288 ... -1006515949 -1006515918 ]
+    passed 
+==== thrust scan, power-of-two ====
+   elapsed time: 3.16525ms    (CUDA Measured)
+    [   0   5  38  63  85 133 159 182 201 237 269 271 288 ... -1006515866 -1006515844 ]
+    passed 
+==== thrust scan, non-power-of-two ====
+   elapsed time: 3.12653ms    (CUDA Measured)
+    [   0   5  38  63  85 133 159 182 201 237 269 271 288 ... -1006515949 -1006515918 ]
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   3   1   1   1   3   0   1   2   1   1   1   2 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 250.09ms    (std::chrono Measured)
+    [   3   1   1   1   3   1   2   1   1   1   2   2   2 ...   3   3 ]
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 247.095ms    (std::chrono Measured)
+    [   3   1   1   1   3   1   2   1   1   1   2   2   2 ...   2   3 ]
+    passed 
+==== cpu compact with scan ====
+   elapsed time: 886.643ms    (std::chrono Measured)
+    [   3   1   1   1   3   1   2   1   1   1   2   2   2 ...   3   3 ]
+    passed 
+==== work-efficient compact, power-of-two ====
+   elapsed time: 58.4025ms    (CUDA Measured)
+    [   3   1   1   1   3   1   2   1   1   1   2   2   2 ...   3   3 ]
+    passed 
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 48.5331ms    (CUDA Measured)
+    [   3   1   1   1   3   1   2   1   1   1   2   2   2 ...   2   3 ]
+    passed 
+```
+
+## Introduction: Stream Compaction
+Stream compaction, essentially, is a technique that aims at removing elements from a list (aka. stream) that are not satisfied according to some criteria. For example, if we have a stream of integers `[1 2 3 2 1 5 23 4 0 0 3 4 2 0 3 8 0]` and we wish to remove "all elements that are 0" (aka. the *criteria*), we would get the remaining compact list `[1 2 3 2 1 5 23 4 3 4 2 3 8]`.
+
+Stream compaction is widely used in rendering & ray tracing. Although it seems straightforward to implement stream compaction in the first place, it is actually non-trivial to implement it on GPU with some parallel algorithms so that its performance can be boosted. This project would discuss the method for parallel stream compaction, and its underlying necessary component &mdash; parallel scan algorithm.
+
+**For more detailed description of the project, please refer to the [project instruction.](INSTRUCTION.md)**
+
+## Parallel Scan
+Parallel scan, aka. parallel prefix sum, is a task of generating a list of numbers in which each index is the sum of all elements that comes before this index. There are two types of parallel scan: *exclusive* scan and *inclusive scan*, where the former inserts 0 at the beginning of output and discards the total sum at the end of list, while the latter keeps the total sum at the end of list and does not introduce 0 at the beginning.
+
+![](img/scan_inclusive_exclusive.png)
+
+### Naive Parallel Scan
+A naive algorithm of implementing parallel scan is shown as follows. For each iteration, part of the thread adds up two elements in the list, producing the final result after several iterations.
+
+![](img/naive_scan.png)
+
+### Work Efficient Parallel Scan
+There is also a much more efficient version of parallel scan, which involves 1) a list reduction, 2) a down-sweep. The list reduction can also be called as the "up-sweep" procedure, producing a total sum of the list with all partial sums in the middle. The down-sweep procedure exactly compensates those missing parts for the middle elements and completes the entire parallel scan.
+
+![](img/upsweep.png)
+
+![](img/downsweep.png)
+
+## Parallel Stream Compaction
+After solving the problem of parallel scan, we can now get to the real algorithm for parallel stream compaction. Essentially, an effective stream compaction procedure consists of the following:
+1. Generate a boolean array marking the validity of each element. For elements to remove, mark as "0"; otherwise mark as "1".
+2. Compute exclusive parallel scan on the boolean array.
+3. Scatter the desired elements into the output array. If an element is marked as "1" in the boolean array, store it into the corresponding indexed parallel scan position in the output array.
+
+![](img/stream_compaction.png)
+
+## Performance Analysis
+**All the tests are conducted with random input array with `srand(0)` on local desktop.**
+
+I roughly found the optimal block size for naive scan algorithm to be 256 and work-efficient scan algorithm to be 128. With these numbers tuned, I ran the program against multiple sizes of input arrays to evaluate the performance.
+
+After careful evaluation, the current performance bottlenecks should be lying in:
+1. Warp divergence and `__syncthreads()`. For both naive scan and work-efficient scan, the threads are utilized in an interleaved pattern, which leads to huge amount of warp divergence.
+2. Global memory accesses are not coalesced. This is due to the same reason with (1), where we access global memory in an interleaved fashion.
+
+Further improvements to the kernel functions includes re-index active threads to minimize warp divergence, as well as breaking work-efficient scan kernel into two small kernels (up-sweep and down-sweep) to eliminate the effect of `__syncthreads()` and warp divergence.
+
+### Parallel Scan, Array Size Power-of-Two
+In this diagram we can see that for large input data, CPU scan takes the most amount of time to run. For naive scan algorithm and work-efficient algorithm, both of them work similarly. When the data size is small, all four methods run roughly the same amount of time. Thrust outperforms all other three methods on large input data.
+
+![](profiling/img/Figure_1.png)
+
+### Parallel Scan, Array Size Non-Power-of-Two
+In this diagram we can see that the four methods have roughly the same behaviors as in [array size of power of two.](#parallel-scan-array-size-power-of-two)
+
+![](profiling/img/Figure_2.png)
+
+### Stream Compaction, Array Size Power-of-Two
+We can see that when data size is small, CPU compaction has roughly the same performance as work-efficient compaction. However, as the data size increases, GPU compaction outperforms CPU compaction.
+
+![](profiling/img/Figure_3.png)
+
+### Stream Compaction, Array Size Non-Power-of-Two
+This diagram shows similar behaviors as in [array size of power of two.](#stream-compaction-array-size-power-of-two)
+
+![](profiling/img/Figure_4.png)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
diff --git a/cmake/.clang-format b/cmake/.clang-format
@@ -0,0 +1,9 @@
+---
+BasedOnStyle: Google
+---
+Language: Cpp
+AccessModifierOffset: -2
+AlignConsecutiveAssignments: true
+AlignConsecutiveMacros: true
+---
+
diff --git a/cmake/cuda_compute_capability.cpp b/cmake/cuda_compute_capability.cpp
@@ -1,58 +1,59 @@
 /*
-* Copyright (C) 2011 Florian Rathgeber, [email protected]
-*
-* This code is licensed under the MIT License.  See the FindCUDA.cmake script
-* for the text of the license.
-*
-* Based on code by Christopher Bruns published on Stack Overflow (CC-BY):
-* http://stackoverflow.com/questions/2285185
-*/
+ * Copyright (C) 2011 Florian Rathgeber, [email protected]
+ *
+ * This code is licensed under the MIT License.  See the FindCUDA.cmake script
+ * for the text of the license.
+ *
+ * Based on code by Christopher Bruns published on Stack Overflow (CC-BY):
+ * http://stackoverflow.com/questions/2285185
+ */
 
-#include <stdio.h>
 #include <cuda_runtime.h>
+#include <stdio.h>
+
 #include <iterator>
 #include <set>
 
 int main() {
-    int deviceCount;
-    int gpuDeviceCount = 0;
-    struct cudaDeviceProp properties;
+  int deviceCount;
+  int gpuDeviceCount = 0;
+  struct cudaDeviceProp properties;
 
-    if (cudaGetDeviceCount(&deviceCount) != cudaSuccess)
-    {
-        printf("Couldn't get device count: %s\n", cudaGetErrorString(cudaGetLastError()));
-        return 1;
-    }
+  if (cudaGetDeviceCount(&deviceCount) != cudaSuccess) {
+    printf("Couldn't get device count: %s\n",
+           cudaGetErrorString(cudaGetLastError()));
+    return 1;
+  }
 
-    std::set<int> computes;
-    typedef std::set<int>::iterator iter;
+  std::set<int> computes;
+  typedef std::set<int>::iterator iter;
 
-    // machines with no GPUs can still report one emulation device
-    for (int device = 0; device < deviceCount; ++device) {
-        int major = 9999, minor = 9999;
-        cudaGetDeviceProperties(&properties, device);
-        if (properties.major != 9999) { // 9999 means emulation only
-            ++gpuDeviceCount;
-            major = properties.major;
-            minor = properties.minor;
-            if ((major == 2 && minor == 1)) {
-                // There is no --arch compute_21 flag for nvcc, so force minor to 0
-                minor = 0;
-            }
-            computes.insert(10 * major + minor);
-        }
-    }
-    int i = 0;
-    for(iter it = computes.begin(); it != computes.end(); it++, i++) {
-        if(i > 0) {
-            printf(" ");
-        }
-        printf("%d", *it);
+  // machines with no GPUs can still report one emulation device
+  for (int device = 0; device < deviceCount; ++device) {
+    int major = 9999, minor = 9999;
+    cudaGetDeviceProperties(&properties, device);
+    if (properties.major != 9999) {  // 9999 means emulation only
+      ++gpuDeviceCount;
+      major = properties.major;
+      minor = properties.minor;
+      if ((major == 2 && minor == 1)) {
+        // There is no --arch compute_21 flag for nvcc, so force minor to 0
+        minor = 0;
+      }
+      computes.insert(10 * major + minor);
     }
-    /* don't just return the number of gpus, because other runtime cuda
-    errors can also yield non-zero return values */
-    if (gpuDeviceCount <= 0 || computes.size() <= 0) {
-        return 1; // failure
+  }
+  int i = 0;
+  for (iter it = computes.begin(); it != computes.end(); it++, i++) {
+    if (i > 0) {
+      printf(" ");
     }
-    return 0; // success
+    printf("%d", *it);
+  }
+  /* don't just return the number of gpus, because other runtime cuda
+  errors can also yield non-zero return values */
+  if (gpuDeviceCount <= 0 || computes.size() <= 0) {
+    return 1;  // failure
+  }
+  return 0;  // success
 }
diff --git a/img/downsweep.png b/img/downsweep.png
diff --git a/img/naive_scan.png b/img/naive_scan.png
diff --git a/img/scan_inclusive_exclusive.png b/img/scan_inclusive_exclusive.png
diff --git a/img/stream_compaction.png b/img/stream_compaction.png
diff --git a/img/upsweep.png b/img/upsweep.png