Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Alex Fu #12

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 37 additions & 33 deletions INSTRUCTION.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ on the implementation of scan and stream compaction.
* The [slides on Parallel Algorithms](https://docs.google.com/presentation/d/1ETVONA7QDM-WqsEj4qVOGD6Kura5I6E9yqH-7krnwZ0/edit#slide=id.p126)
for Scan, Stream Compaction, and Work-Efficient Parallel Scan.
* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html).
- This online version contains a few small errors (in superscripting, missing braces, bad indentation, etc.)
- We maintain a fix for this at [GPU Gem 3 Ch 39 Patch](https://github.com/CIS565-Fall-2017/Project2-Stream-Compaction/blob/master/INSTRUCTION.md#gpu-gem-3-ch-39-patch). If you find more errors in the chapter, welcome to open new pull requests to contribute.
- This online version contains a few small errors (in superscripting, missing braces, bad indentation, etc.)
- We maintain a fix for this at [GPU Gem 3 Ch 39 Patch](https://github.com/CIS565-Fall-2017/Project2-Stream-Compaction/blob/master/INSTRUCTION.md#gpu-gem-3-ch-39-patch). If you find more errors in the chapter, welcome to open new pull requests to contribute.
* If you are still unclear after reading the steps, take a look at the last chapter - [Algorithm Examples](https://github.com/CIS565-Fall-2017/Project2-Stream-Compaction/blob/master/INSTRUCTION.md#algorithm-examples).
* [Recitation slides](https://docs.google.com/presentation/d/1daOnWHOjMp1sIqMdVsNnvEU1UYynKcEMARc_W6bGnqE/edit?usp=sharing)

Expand Down Expand Up @@ -116,8 +116,9 @@ Most of the text in Part 2 applies.
Since the work-efficient scan operates on a binary tree structure, it works
best with arrays with power-of-two length. Make sure your implementation works
on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory

- your intermediate array sizes will need to be rounded to the next power of
two.
two.

### 3.2. Stream Compaction

Expand Down Expand Up @@ -152,13 +153,12 @@ For thrust stream compaction, take a look at [thrust::remove_if](https://thrust.

## Part 5: Why is My GPU Approach So Slow? (Extra Credit) (+5)

If you implement your efficient scan version following the slides closely, there's a good chance
that you are getting an "efficient" gpu scan that is actually not that efficient -- it is slower than the cpu approach?
If you implement your efficient scan version following the slides closely, there's a good chance that you are getting an "efficient" gpu scan that is actually not that efficient -- it is slower than the cpu approach?

Though it is totally acceptable for this assignment,
In addition to explain the reason of this phenomena, you are encouraged to try to upgrade your work-efficient gpu scan.
Though it is totally acceptable for this assignment, In addition to explain the reason of this phenomena, you are encouraged to try to upgrade your work-efficient gpu scan.

Thinking about these may lead you to an aha moment:

- What is the occupancy at a deeper level in the upper/down sweep? Are most threads actually working?
- Are you always launching the same number of blocks throughout each level of the upper/down sweep?
- If some threads are being lazy, can we do an early termination on them?
Expand Down Expand Up @@ -199,11 +199,13 @@ Always profile with Release mode builds and run without debugging.

* Roughly optimize the block sizes of each of your implementations for minimal
run time on your GPU.

* (You shouldn't compare unoptimized implementations to each other!)

* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
(with array size on the independent axis).

* We wrapped up both CPU and GPU timing functions as a performance timer class for you to conveniently measure the time cost.
* We use `std::chrono` to provide CPU high-precision timing and CUDA event to measure the CUDA performance.
* For CPU, put your CPU code between `timer().startCpuTimer()` and `timer().endCpuTimer()`.
Expand All @@ -215,11 +217,13 @@ Always profile with Release mode builds and run without debugging.
even looking at the code for the implementation.

* Write a brief explanation of the phenomena you see here.

* Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
it different for each implementation?

* Paste the output of the test program into a triple-backtick block in your
README.

* If you add your own tests (e.g. for radix sort or to test additional corner
cases), be sure to mention it explicitly.

Expand All @@ -238,24 +242,24 @@ The template of the comment section of your pull request is attached below, you

* [Repo Link](https://link-to-your-repo)
* (Briefly) Mentions features that you've completed. Especially those bells and whistles you want to highlight
* Feature 0
* Feature 1
* ...
* Feature 0
* Feature 1
* ...
* Feedback on the project itself, if any.

## GPU Gem 3 Ch 39 Patch

* Example 1
![](img/example-1.png)
![](img/example-1.png)

* Example 2
![](img/example-2.jpg)
![](img/example-2.jpg)

* Figure-39-4
![](img/figure-39-4.jpg)
![](img/figure-39-4.jpg)

* Figure-39-2. This image shows an naive inclusive scan. We should convert this to an exclusive one for compaction.
![](img/figure-39-2.jpg)
![](img/figure-39-2.jpg)

## Algorithm Examples

Expand Down Expand Up @@ -284,24 +288,24 @@ The template of the comment section of your pull request is attached below, you
+ output
- [1 1 0 1 1 0 1]
- scan
+ take the output of last step as input
+ input
+ take the output of last step as input
+ input
- [1 1 0 1 1 0 1]
+ output
- [0 1 2 2 3 4 4]
- scatter
+ preserve non-zero elements and compact them into a new array
+ input:
+ original array
- [1 5 0 1 2 0 3]
+ mapped array
- [1 1 0 1 1 0 1]
+ output
+ scanned array
- [0 1 2 2 3 4 4]
- scatter
+ preserve non-zero elements and compact them into a new array
+ input:
+ original array
- [1 5 0 1 2 0 3]
+ mapped array
- [1 1 0 1 1 0 1]
+ scanned array
- [0 1 2 2 3 4 4]
+ output:
- [1 5 1 2 3]
+ This can be done in parallel on GPU
+ You can try multi-threading on CPU if you want (not required and not our focus)
+ for each element input[i] in original array
- if it's non-zero (given by mapped array)
- then put it at output[index], where index = scanned[i]
+ output:
- [1 5 1 2 3]
+ This can be done in parallel on GPU
+ You can try multi-threading on CPU if you want (not required and not our focus)
+ for each element input[i] in original array
- if it's non-zero (given by mapped array)
- then put it at output[index], where index = scanned[i]
184 changes: 177 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,184 @@
**University of Pennsylvania, CIS 565: GPU Programming and Architecture**

- Alex Fu
- [LinkedIn](https://www.linkedin.com/in/alex-fu-b47b67238/)
- [Twitter](https://twitter.com/AlexFu8304)
- [Personal Website](https://thecger.com/)
- Tested on: Windows 10, i7-10750H @ 2.60GHz, 16GB, GTX 3060 6GB

CUDA Stream Compaction
======================

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
## Features Implemented

* CPU Scan and Compaction

* Naive GPU Scan

* Work-efficient GPU Scan

* GPU Steam Compaction

* Radix Sort (Extra Credit)

* When input is:

`{41, 17, 34, 0, 19, 24, 28, 8, 12, 14, 5, 45, 31, 27, 11, 41, 45, 42, 27, 36, 41, 4, 2, 3, 42, 32, 21, 16, 18, 45, 47, 26, 21, 38, 19, 12, 17, 49, 35, 44, 3, 11, 22, 33, 23, 14, 41, 11, 3, 18, 47, 44, 12, 7, 37, 9, 23, 41, 29, 28, 16, 35, 40, 0}`,

the output is:

`{0, 0, 2, 3, 3, 3, 4, 5, 7, 8, 9, 11, 11, 11, 12, 12, 12, 14, 14, 16, 16, 17, 17, 18, 18, 19, 19, 21, 21, 22, 23, 23, 24, 26, 27, 27, 28, 28, 29, 31, 32, 33, 34, 35, 35, 36, 37, 38, 40, 41, 41, 41, 41, 41, 42, 42, 44, 44, 45, 45, 45, 47, 47, 49}`.

I also ran the comparison between my radix sort and `thrust::sort` (see [Example Output](#example-output) and [Performance Analysis](#radix-sort)).

### Example Output

</div>

`SIZE` is $2^{20} = 1.05 \times 10 ^ 6$. The test size for non-power-of-two case is `SIZE - 3`. CUDA block size is 128.

```
****************
** SCAN TESTS **
****************
[ 41 17 34 0 19 24 28 8 12 14 5 45 31 ... 20 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.5527ms (std::chrono Measured)
[ 0 41 58 92 92 111 135 163 171 183 197 202 247 ... 25683436 25683456 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.5991ms (std::chrono Measured)
[ 0 41 58 92 92 111 135 163 171 183 197 202 247 ... 25683337 25683375 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.607232ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.884736ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.359712ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.351232ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.16576ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.390144ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 3 2 0 1 0 2 2 2 0 1 1 1 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 2.5172ms (std::chrono Measured)
[ 1 3 2 1 2 2 2 1 1 1 3 1 3 ... 2 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 2.2528ms (std::chrono Measured)
[ 1 3 2 1 2 2 2 1 1 1 3 1 3 ... 2 2 ]
passed
==== cpu compact with scan, power-of-two ====
elapsed time: 4.2481ms (std::chrono Measured)
passed
==== cpu compact with scan, non-power-of-two ====
elapsed time: 4.5622ms (std::chrono Measured)
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.42496ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.42096ms (CUDA Measured)
passed

**********************
** RADIX SORT TESTS **
**********************
[ 41 17 34 0 19 24 28 8 12 14 5 45 31 ... 20 0 ]
==== thrust sort, power-of-two ====
elapsed time: 0.421664ms (CUDA Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
==== thrust sort, non-power-of-two ====
elapsed time: 0.342016ms (CUDA Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
==== radix sort, power-of-two ====
elapsed time: 20.2885ms (CUDA Measured)
passed
==== radix sort, non-power-of-two ====
elapsed time: 23.1066ms (CUDA Measured)
passed
```

## Performance Analysis

### Performance Impacted by Array Size

CUDA block size is 128.

#### Scan

When array size is small, the CPU implementation is faster than GPU implementation and the fluctuation in GPU implementation time cost is small. When array size is larger than 2.62e5, both thrust function and my work-efficient implementation outperform my CPU implementation.

![Scan Time Impacted by Array Size Power of Two](img/Scan_Time_Impacted_by_Array_Size_Power_of_Two.png)

<!-- ![Scan Time Impacted by Array Size (Non Power of Two)](img/Scan Time Impacted by Array Size Non Power of Two.png) -->

#### Compaction

The situation is the same as scan and the turning point is 6.55e4.

![Compaction Time Impacted by Array Size Power of Two](img/Compaction_Time_Impacted_by_Array_Size_Power_of_Two.png)

<!-- ![Compaction Time Impacted by Array Size (Non Power of Two)](img/Compaction Time Impacted by Array Size Non Power of Two.png) -->

#### Radix Sort

My implementation of radix sort is very slower than thrust function.

![Sort Time Impacted by Array Size Power of Two](img/Sort_Time_Impacted_by_Array_Size_Power_of_Two.png)

<!-- ![Sort Time Impacted by Array Size (Non Power of Two)](img/Sort Time Impacted by Array Size Non Power of Two.png) -->

### Performance Impacted by CUDA Block Size

`SIZE` is $2^{20} = 1.05 \times 10 ^ 6$.

#### Scan

![Scan Time Impacted by CUDA Block Size Power of Two](img/Scan_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png)

#### Compaction

![Compaction Time Impacted by CUDA Block Size Power of Two](img/Compaction_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png)

#### Radix Sort

![Radix Sort Time Impacted by CUDA Block Size Power of Two](img/Radix_Sort_Impacted_by_CUDA_Block_Size_Power_of_Two.png)

## Why is My GPU Approach So Slow?

The optimization I made to the the work-efficient scan is to avoid Warp Partitioning by compressing the threads:

![Threads Allocation of the Down Sweep Function](img/Threads_Allocation_of_the_Down_Sweep_Function.png)

Due to time constraints, I haven't implemeted the shared memory part. I guess this is where the thurst function surpasses mine.

My radix sort (6 bit)'s time cost is about 10 times as much as my work-efficient scan's. This matches my instinct because radix sort will repeate the scan function in each sort. However, I noticed that the time cost of thrust sort function is not that slower than its scan function. For instance, when array size is 65536, the thrust scan costs 0.04ms while sort costs 0.09ms. This drives me to think if there is more optimizations I can do on radix sort.

One drawback of my radix sort I can recognize is that in order to compute `totalFalse`, I make two device-to-host memery copies:

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
```cpp
int totalFalse;
int lastNum;
cudaMemcpy(&totalFalse, devFalse + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&lastNum, devInp + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
if ((lastNum & (1 << bit)) == 0) totalFalse += 1;
```

### (TODO: Your README)
I believe this can be optimized somehow.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
## Some mistakes I made

* At first I put the for loop inside the kernel functions and used a `__syncthreads()` at the begining of each iteration. However, since `__syncthreads()` is block-wise, my result went wrong when the array size exceeded my block size. Then I put the for loop outside the kernel functions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/performance_analysis.xlsx
Binary file not shown.
Loading