CIS565-Fall-2022 · IwakuraRein · Sep 16, 2022 · Sep 17, 2022 · Sep 17, 2022 · Sep 18, 2022
diff --git a/INSTRUCTION.md b/INSTRUCTION.md
@@ -29,8 +29,8 @@ on the implementation of scan and stream compaction.
 * The [slides on Parallel Algorithms](https://docs.google.com/presentation/d/1ETVONA7QDM-WqsEj4qVOGD6Kura5I6E9yqH-7krnwZ0/edit#slide=id.p126)
   for Scan, Stream Compaction, and Work-Efficient Parallel Scan.
 * GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html).
-    - This online version contains a few small errors (in superscripting, missing braces, bad indentation, etc.)
-    - We maintain a fix for this at [GPU Gem 3 Ch 39 Patch](https://github.com/CIS565-Fall-2017/Project2-Stream-Compaction/blob/master/INSTRUCTION.md#gpu-gem-3-ch-39-patch). If you find more errors in the chapter, welcome to open new pull requests to contribute.
+  - This online version contains a few small errors (in superscripting, missing braces, bad indentation, etc.)
+  - We maintain a fix for this at [GPU Gem 3 Ch 39 Patch](https://github.com/CIS565-Fall-2017/Project2-Stream-Compaction/blob/master/INSTRUCTION.md#gpu-gem-3-ch-39-patch). If you find more errors in the chapter, welcome to open new pull requests to contribute.
 * If you are still unclear after reading the steps, take a look at the last chapter - [Algorithm Examples](https://github.com/CIS565-Fall-2017/Project2-Stream-Compaction/blob/master/INSTRUCTION.md#algorithm-examples).
 * [Recitation slides](https://docs.google.com/presentation/d/1daOnWHOjMp1sIqMdVsNnvEU1UYynKcEMARc_W6bGnqE/edit?usp=sharing)
 
@@ -116,8 +116,9 @@ Most of the text in Part 2 applies.
 Since the work-efficient scan operates on a binary tree structure, it works
 best with arrays with power-of-two length. Make sure your implementation works
 on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory
+
 - your intermediate array sizes will need to be rounded to the next power of
-two.
+  two.
 
 ### 3.2. Stream Compaction
 
@@ -152,13 +153,12 @@ For thrust stream compaction, take a look at [thrust::remove_if](https://thrust.
 
 ## Part 5: Why is My GPU Approach So Slow? (Extra Credit) (+5)
 
-If you implement your efficient scan version following the slides closely, there's a good chance
-that you are getting an "efficient" gpu scan that is actually not that efficient -- it is slower than the cpu approach?
+If you implement your efficient scan version following the slides closely, there's a good chance that you are getting an "efficient" gpu scan that is actually not that efficient -- it is slower than the cpu approach?
 
-Though it is totally acceptable for this assignment,
-In addition to explain the reason of this phenomena, you are encouraged to try to upgrade your work-efficient gpu scan.
+Though it is totally acceptable for this assignment, In addition to explain the reason of this phenomena, you are encouraged to try to upgrade your work-efficient gpu scan.
 
 Thinking about these may lead you to an aha moment:
+
 - What is the occupancy at a deeper level in the upper/down sweep? Are most threads actually working?
 - Are you always launching the same number of blocks throughout each level of the upper/down sweep?
 - If some threads are being lazy, can we do an early termination on them?
@@ -199,11 +199,13 @@ Always profile with Release mode builds and run without debugging.
 
 * Roughly optimize the block sizes of each of your implementations for minimal
   run time on your GPU.
+
   * (You shouldn't compare unoptimized implementations to each other!)
 
 * Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
   Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
   (with array size on the independent axis).
+
   * We wrapped up both CPU and GPU timing functions as a performance timer class for you to conveniently measure the time cost.
     * We use `std::chrono` to provide CPU high-precision timing and CUDA event to measure the CUDA performance.
     * For CPU, put your CPU code between `timer().startCpuTimer()` and `timer().endCpuTimer()`.
@@ -215,11 +217,13 @@ Always profile with Release mode builds and run without debugging.
     even looking at the code for the implementation.
 
 * Write a brief explanation of the phenomena you see here.
+
   * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
     it different for each implementation?
 
 * Paste the output of the test program into a triple-backtick block in your
   README.
+
   * If you add your own tests (e.g. for radix sort or to test additional corner
     cases), be sure to mention it explicitly.
 
@@ -238,24 +242,24 @@ The template of the comment section of your pull request is attached below, you
 
 * [Repo Link](https://link-to-your-repo)
 * (Briefly) Mentions features that you've completed. Especially those bells and whistles you want to highlight
-    * Feature 0
-    * Feature 1
-    * ...
+  * Feature 0
+  * Feature 1
+  * ...
 * Feedback on the project itself, if any.
 
 ## GPU Gem 3 Ch 39 Patch
 
 * Example 1
-![](img/example-1.png)
+  ![](img/example-1.png)
 
 * Example 2
-![](img/example-2.jpg)
+  ![](img/example-2.jpg)
 
 * Figure-39-4
-![](img/figure-39-4.jpg)
+  ![](img/figure-39-4.jpg)
 
 * Figure-39-2. This image shows an naive inclusive scan. We should convert this to an exclusive one for compaction.
-![](img/figure-39-2.jpg)
+  ![](img/figure-39-2.jpg)
 
 ## Algorithm Examples
 
@@ -284,24 +288,24 @@ The template of the comment section of your pull request is attached below, you
       + output
         - [1 1 0 1 1 0 1]
     - scan
-        + take the output of last step as input
-        + input
+      + take the output of last step as input
+      + input
+        - [1 1 0 1 1 0 1]
+      + output
+        - [0 1 2 2 3 4 4]
+    - scatter
+      + preserve non-zero elements and compact them into a new array
+      + input:
+        + original array
+          - [1 5 0 1 2 0 3]
+        + mapped array
           - [1 1 0 1 1 0 1]
-        + output
+        + scanned array
           - [0 1 2 2 3 4 4]
-    - scatter
-        + preserve non-zero elements and compact them into a new array
-        + input:
-          + original array
-            - [1 5 0 1 2 0 3]
-          + mapped array
-            - [1 1 0 1 1 0 1]
-          + scanned array
-            - [0 1 2 2 3 4 4]
-        + output:
-          - [1 5 1 2 3]
-        + This can be done in parallel on GPU
-        + You can try multi-threading on CPU if you want (not required and not our focus)
-        + for each element input[i] in original array
-          - if it's non-zero (given by mapped array)
-          - then put it at output[index], where index = scanned[i]
+      + output:
+        - [1 5 1 2 3]
+      + This can be done in parallel on GPU
+      + You can try multi-threading on CPU if you want (not required and not our focus)
+      + for each element input[i] in original array
+        - if it's non-zero (given by mapped array)
+        - then put it at output[index], where index = scanned[i]
diff --git a/README.md b/README.md
@@ -1,14 +1,184 @@
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture**
+
+- Alex Fu
+  - [LinkedIn](https://www.linkedin.com/in/alex-fu-b47b67238/)
+  - [Twitter](https://twitter.com/AlexFu8304)
+  - [Personal Website](https://thecger.com/)
+- Tested on: Windows 10, i7-10750H @ 2.60GHz, 16GB, GTX 3060 6GB
+
 CUDA Stream Compaction
 ======================
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+## Features Implemented
+
+* CPU Scan and Compaction
+
+* Naive GPU Scan
+
+* Work-efficient GPU Scan
+
+* GPU Steam Compaction
+
+* Radix Sort (Extra Credit)
+
+  * When input is:
+
+    `{41, 17, 34, 0, 19, 24, 28, 8, 12, 14, 5, 45, 31, 27, 11, 41, 45, 42, 27, 36, 41, 4, 2, 3, 42, 32, 21, 16, 18, 45, 47, 26, 21, 38, 19, 12, 17, 49, 35, 44, 3, 11, 22, 33, 23, 14, 41, 11, 3, 18, 47, 44, 12, 7, 37, 9, 23, 41, 29, 28, 16, 35, 40, 0}`,
+
+    the output is:
+
+     `{0, 0, 2, 3, 3, 3, 4, 5, 7, 8, 9, 11, 11, 11, 12, 12, 12, 14, 14, 16, 16, 17, 17, 18, 18, 19, 19, 21, 21, 22, 23, 23, 24, 26, 27, 27, 28, 28, 29, 31, 32, 33, 34, 35, 35, 36, 37, 38, 40, 41, 41, 41, 41, 41, 42, 42, 44, 44, 45, 45, 45, 47, 47, 49}`.
+
+     I also ran the comparison between my radix sort and `thrust::sort` (see [Example Output](#example-output) and [Performance Analysis](#radix-sort)).
+
+### Example Output
+
+</div>
+
+`SIZE` is $2^{20} = 1.05 \times 10 ^ 6$. The test size for non-power-of-two case is `SIZE - 3`. CUDA block size is 128.
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  41  17  34   0  19  24  28   8  12  14   5  45  31 ...  20   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.5527ms    (std::chrono Measured)
+    [   0  41  58  92  92 111 135 163 171 183 197 202 247 ... 25683436 25683456 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.5991ms    (std::chrono Measured)
+    [   0  41  58  92  92 111 135 163 171 183 197 202 247 ... 25683337 25683375 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.607232ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.884736ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.359712ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.351232ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.16576ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.390144ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   3   2   0   1   0   2   2   2   0   1   1   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 2.5172ms    (std::chrono Measured)
+    [   1   3   2   1   2   2   2   1   1   1   3   1   3 ...   2   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 2.2528ms    (std::chrono Measured)
+    [   1   3   2   1   2   2   2   1   1   1   3   1   3 ...   2   2 ]
+    passed
+==== cpu compact with scan, power-of-two ====
+   elapsed time: 4.2481ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan, non-power-of-two ====
+   elapsed time: 4.5622ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.42496ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.42096ms    (CUDA Measured)
+    passed
+
+**********************
+** RADIX SORT TESTS **
+**********************
+    [  41  17  34   0  19  24  28   8  12  14   5  45  31 ...  20   0 ]
+==== thrust sort, power-of-two ====
+   elapsed time: 0.421664ms    (CUDA Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  49  49 ]
+==== thrust sort, non-power-of-two ====
+   elapsed time: 0.342016ms    (CUDA Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  49  49 ]
+==== radix sort, power-of-two ====
+   elapsed time: 20.2885ms    (CUDA Measured)
+    passed
+==== radix sort, non-power-of-two ====
+   elapsed time: 23.1066ms    (CUDA Measured)
+    passed
+```
+
+## Performance Analysis
+
+### Performance Impacted by Array Size
+
+CUDA block size is 128.
+
+#### Scan
+
+When array size is small, the CPU implementation is faster than GPU implementation and the fluctuation in GPU implementation time cost is small. When array size is larger than 2.62e5, both thrust function and my work-efficient implementation outperform my CPU implementation.
+
+![Scan Time Impacted by Array Size Power of Two](img/Scan_Time_Impacted_by_Array_Size_Power_of_Two.png)
+
+<!-- ![Scan Time Impacted by Array Size (Non Power of Two)](img/Scan Time Impacted by Array Size Non Power of Two.png) -->
+
+#### Compaction
+
+The situation is the same as scan and the turning point is 6.55e4. 
+
+![Compaction Time Impacted by Array Size Power of Two](img/Compaction_Time_Impacted_by_Array_Size_Power_of_Two.png)
+
+<!-- ![Compaction Time Impacted by Array Size (Non Power of Two)](img/Compaction Time Impacted by Array Size Non Power of Two.png) -->
+
+#### Radix Sort
+
+My implementation of radix sort is very slower than thrust function. 
+
+![Sort Time Impacted by Array Size Power of Two](img/Sort_Time_Impacted_by_Array_Size_Power_of_Two.png)
+
+<!-- ![Sort Time Impacted by Array Size (Non Power of Two)](img/Sort Time Impacted by Array Size Non Power of Two.png) -->
+
+### Performance Impacted by CUDA Block Size
+
+`SIZE` is $2^{20} = 1.05 \times 10 ^ 6$.
+
+#### Scan
+
+![Scan Time Impacted by CUDA Block Size Power of Two](img/Scan_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png)
+
+#### Compaction
+
+![Compaction Time Impacted by CUDA Block Size Power of Two](img/Compaction_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png)
+
+#### Radix Sort
+
+![Radix Sort Time Impacted by CUDA Block Size Power of Two](img/Radix_Sort_Impacted_by_CUDA_Block_Size_Power_of_Two.png)
+
+## Why is My GPU Approach So Slow?
+
+The optimization I made to the the work-efficient scan is to avoid Warp Partitioning by compressing the threads:
+
+![Threads Allocation of the Down Sweep Function](img/Threads_Allocation_of_the_Down_Sweep_Function.png)
+
+Due to time constraints, I haven't implemeted the shared memory part. I guess this is where the thurst function surpasses mine.
+
+My radix sort (6 bit)'s time cost is about 10 times as much as my work-efficient scan's. This matches my instinct because radix sort will repeate the scan function in each sort. However, I noticed that the time cost of thrust sort function is not that slower than its scan function. For instance, when array size is 65536, the thrust scan costs 0.04ms while sort costs 0.09ms. This drives me to think if there is more optimizations I can do on radix sort.
+
+One drawback of my radix sort I can recognize is that in order to compute `totalFalse`, I make two device-to-host memery copies:
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+```cpp
+int totalFalse;
+int lastNum;
+cudaMemcpy(&totalFalse, devFalse + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+cudaMemcpy(&lastNum, devInp + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+if ((lastNum & (1 << bit)) == 0) totalFalse += 1;
+```
 
-### (TODO: Your README)
+ I believe this can be optimized somehow.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Some mistakes I made
 
+* At first I put the for loop inside the kernel functions and used a `__syncthreads()` at the begining of each iteration. However, since `__syncthreads()` is block-wise, my result went wrong when the array size exceeded my block size. Then I put the for loop outside the kernel functions.
diff --git a/img/Compaction Time Impacted by Array Size Non Power of Two.png b/img/Compaction Time Impacted by Array Size Non Power of Two.png
diff --git a/img/Compaction_Time_Impacted_by_Array_Size_Power_of_Two.png b/img/Compaction_Time_Impacted_by_Array_Size_Power_of_Two.png
diff --git a/img/Compaction_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png b/img/Compaction_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png
diff --git a/img/Radix_Sort_Impacted_by_CUDA_Block_Size_Power_of_Two.png b/img/Radix_Sort_Impacted_by_CUDA_Block_Size_Power_of_Two.png
diff --git a/img/Scan Time Impacted by Array Size Non Power of Two.png b/img/Scan Time Impacted by Array Size Non Power of Two.png
diff --git a/img/Scan_Time_Impacted_by_Array_Size_Power_of_Two.png b/img/Scan_Time_Impacted_by_Array_Size_Power_of_Two.png
diff --git a/img/Scan_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png b/img/Scan_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png
diff --git a/img/Sort Time Impacted by Array Size Non Power of Two.png b/img/Sort Time Impacted by Array Size Non Power of Two.png
diff --git a/img/Sort_Time_Impacted_by_Array_Size_Power_of_Two.png b/img/Sort_Time_Impacted_by_Array_Size_Power_of_Two.png
diff --git a/img/Threads_Allocation_of_the_Down_Sweep_Function.png b/img/Threads_Allocation_of_the_Down_Sweep_Function.png
diff --git a/img/performance_analysis.xlsx b/img/performance_analysis.xlsx