Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Tong Hu #21

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 111 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,118 @@
CUDA Stream Compaction
======================

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2 - Stream Compaction**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Tong Hu
* [LinkedIn](https://www.linkedin.com/in/tong-hu-5819a122a/)
* Time tested on: Windows 11, Ryzen 7 1700X @ 3.4GHz 16GB, GTX 1080 16GB (Personal Computer)
* Nsight system analysis tested on: Windows 11, i5-11600K @ 3.91GHz 32GB, RTX 2060 6GB (Personal Desktop) (since GTX 1080 does not support GPU metric collection)

### (TODO: Your README)
### Features
- CPU Scan & Stream Compaction
- Naive GPU Scan algorithm
- Work-Efficient GPU Scan (extra credit version) & Stream Compaction
- Using Thrust's Implementation

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
### Introduction
In this project, I implemented the scan function commonly used in stream compaction of int array. I implemented 3 different version (CPU version, naive parallel version, and efficient work version) of the Scan(exlusive Prefix Sum) algorithm, and use thrust's implemntation of exclusive scan, then compared the performance of each version of Scan.

### Roughly optimize the block sizes
Figure1. Time(ms) of scan function vs. block size (array size = $2^{27}$)
![Time(ms) vs. Block Size](/img/time_vs_blockSize.png)

Table 1. Time(ms) of scan function in different block size (array size = $2^{27}$)
| Block size | 1 | 8 | 16 | 32 | 64 | 128 | 256 | 512 |
| --------------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- |
| naïve pow2 | 4378.43 | 588.428 | 313.288 | 164.764 | 146.405 | 144.845 | 144.111 | 144.72 |
| naïve npow2 | 4373.24 | 571.921 | 291.82 | 164.399 | 146.764 | 144.936 | 144.916 | 144.649 |
| efficient pow2 | 328.222 | 64.9941 | 57.1655 | 58.7879 | 62.4402 | 62.0726 | 62.0135 | 60.9551 |
| efficient npow2 | 332.084 | 65.0913 | 56.9548 | 58.7963 | 62.0282 | 62.4519 | 61.8724 | 61.3648 |

From Figure 1 we can see that as the block size increases, the time of Scan function decreases rapidly, and after some point, the time does not change much although the number of block increased. From Table 1 we can see the naive parallel scan performs best when the block size equals to 256, while the work efficient parallel scan performs best when the block size is 16. Therefore I select 256 and 16 as block sizes for naive scan and work-efficient scan in the following performance analysis.

### Performance analysis
Figure 2. Time(ms) of scan function in diffrent array size.
![Time(ms) vs. array size](/img/time_vs_arraySize.png)

Table 2. Time(ms) of scan function in diffrent array size.
| log of array size | 4 | 8 | 16 | 20 | 22 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |
| ----------------- | -------- | -------- | -------- | -------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- |
| cpu | 0.0002 | 0.0008 | 0.153 | 2.4054 | 9.0351 | 37.6997 | 78.1179 | 153.777 | 304.135 | 592.904 | 1266.01 | 2379.02 |
| naïve | 0.02048 | 0.19456 | 0.318272 | 0.918432 | 3.64445 | 16.1397 | 33.5231 | 68.8091 | 144.143 | 302.976 | 639.805 | 4626.05 |
| efficient | 0.306176 | 0.326656 | 0.566368 | 1.32307 | 2.16864 | 7.9192 | 15.282 | 29.2795 | 57.1435 | 113.729 | 234.424 | 429.09 |
| thrust | 0.233504 | 0.131072 | 0.111616 | 0.929792 | 0.75264 | 1.57184 | 2.51597 | 3.45184 | 5.73133 | 10.7909 | 20.3489 | 112.254 |


From Figure 2 we can tell that when the array size increased, the time cost of Scan will increase.

When the array size is small (smaller than $2^{16}$), CPU Scan performs better than GPU Scans and thrust's Scan. This is probably because the overhead of invoking GPU kernels overwhelms the benefit of parallel scans. When the array size is relative large, the time cost less for GPU Scans.

Comparing the naive parallel scan and work-efficient parallel scan, we can see from the figure that the work-efficient parallel scan performs better when the array size is large (greater than $2^{20}$). It takes naive parallel scan $O(n\log n)$ floating point adds operations while takes work-efficient scan $O(n)$ adds. Although both algorithm seems to run in parallel in the ideal case, in reality the number of threads run in parallel is bounded by hardware, and therefore, work-efficient scan performs better since the number of threads it need to lauch is smaller.

Figure 3. Nsight system trace
![Nsight system trace](/img/overall_2.png)

From Figure 3, we can see that compared with self-implemented Scan, the thrust's scan has very low DRAM bandwidth usage, and the unallocated warps in active SMs are also very low. Following are factors that might affect the performance:

1. Thrust's Scan might use shared memory, and memory coalescing when accessing global memory. This improves the memory throughput.
2. Thrust's Scan might optimize the block size and launch parameters based on workload and GPU type while our self-implemented scans hard code the block size only roughly optimized by eye.

### Bottleneck
Also, from Figure 3, we can know that the bottleneck of performance of Scan functions should be the memory bandwidth. The bandwidth of DRAM of both naive and work-efficient Scan are almost full.

### The output of the test program when the array size is $2^{25}$.
```
****************
** SCAN TESTS **
****************
[ 26 0 20 47 36 3 37 33 24 20 25 43 43 ... 6 0 ]
==== cpu scan, power-of-two ====
elapsed time: 79.4345ms (std::chrono Measured)
[ 0 26 26 46 93 129 132 169 202 226 246 271 314 ... 821748880 821748886 ]
==== cpu scan, non-power-of-two ====
elapsed time: 78.2092ms (std::chrono Measured)
[ 0 26 26 46 93 129 132 169 202 226 246 271 314 ... 821748831 821748870 ]
passed
==== naive scan, power-of-two ====
elapsed time: 34.0556ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 33.1287ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 14.6806ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 14.6292ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 2.33472ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 2.2352ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 0 0 1 1 0 0 2 2 1 3 1 0 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 82.3193ms (std::chrono Measured)
[ 1 1 1 2 2 1 3 1 1 1 2 1 2 ... 2 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 82.2665ms (std::chrono Measured)
[ 1 1 1 2 2 1 3 1 1 1 2 1 2 ... 3 2 ]
passed
==== cpu compact with scan ====
elapsed time: 329.982ms (std::chrono Measured)
[ 1 1 1 2 2 1 3 1 1 1 2 1 2 ... 2 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 21.669ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 21.466ms (CUDA Measured)
passed
```
Binary file added img/overall.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/overall_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/thrust.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/time_vs_arraySize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/time_vs_blockSize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 25; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down
9 changes: 9 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#include "common.h"
#include "device_launch_parameters.h"

void checkCUDAErrorFn(const char *msg, const char *file, int line) {
cudaError_t err = cudaGetLastError();
Expand All @@ -24,6 +25,9 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx >= n) return;
bools[idx] = idata[idx] == 0 ? 0 : 1;
}

/**
Expand All @@ -33,6 +37,11 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx >= n) return;
if (bools[idx] == 1) {
odata[indices[idx]] = idata[idx];
}
}

}
Expand Down
1 change: 1 addition & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
//#define BLOCK_SIZE 128

/**
* Check for CUDA errors; print and exit if there was a problem.
Expand Down
36 changes: 32 additions & 4 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#include <cstdio>
#include "cpu.h"
#include <iostream>

#include "common.h"

Expand All @@ -19,7 +20,11 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
// TODO: traverse all elements and record the exclusive prefix sum to odata.
odata[0] = 0;
for (int i = 1; i < n; i++) {
odata[i] = odata[i-1] + idata[i-1];
}
timer().endCpuTimer();
}

Expand All @@ -30,9 +35,16 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
// TODO: traverse all elements and copy to odata without 0's.
int count = 0;
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[count] = idata[i];
count++;
}
}
timer().endCpuTimer();
return -1;
return count;
}

/**
Expand All @@ -41,10 +53,26 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
int* scanResult = new int[n];
int* bools = new int[n];
timer().startCpuTimer();
// TODO
//std::cout << "start traverse and build temp arr" << std::endl;
int sum = 0;
for (int i = 0; i < n; i++) {
bools[i] = idata[i] == 0 ? 0 : 1;
scanResult[i] = sum;
sum += bools[i];
}
for (int i = 0; i < n; i++) {
if (bools[i] == 1) {
odata[scanResult[i]] = idata[i];
}
}
timer().endCpuTimer();
return -1;
int count = scanResult[n - 1] + bools[n - 1];
delete[] scanResult;
return count;
}
}
}
88 changes: 86 additions & 2 deletions stream_compaction/efficient.cu
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@
#include <cuda_runtime.h>
#include "common.h"
#include "efficient.h"
#include "device_launch_parameters.h"
#include <iostream>

#define BLOCK_SIZE 16

namespace StreamCompaction {
namespace Efficient {
Expand All @@ -12,13 +16,65 @@ namespace StreamCompaction {
return timer;
}

__global__ void efficientScanUpSweep(int n, int nThread, int d, int *data) {
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx >= nThread) return;
int currIdx = ((idx + 1) << (d + 1)) - 1;
int prevIdx = currIdx - (1 << d);
data[currIdx] += data[prevIdx];
}

__global__ void efficientScanDownSweep(int n, int nThread, int d, int *data) {
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx >= nThread) return;
int currIdx = (n >> d) * (idx + 1) - 1;
int prevIdx = currIdx - (n >> (d + 1));
int temp = data[currIdx];
data[currIdx] += data[prevIdx];
data[prevIdx] = temp;
}

__global__ void replaceWithZero(int n, int nThread, int* data) {
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx >= nThread) return;
data[n - 1] = 0;
}

void efficientScanUpDownSweep(int n, int newN, int* dev_idata) {
dim3 numBlocks;
int nThread = newN;
// up sweep
for (int d = 0; d < ilog2ceil(n); d++) {
nThread = newN >> (d + 1);
numBlocks = (nThread + BLOCK_SIZE - 1) / BLOCK_SIZE;
efficientScanUpSweep <<<numBlocks, BLOCK_SIZE >>> (newN, nThread, d, dev_idata);
}
// replace the last number of the array with 0.
replaceWithZero <<<1, 1 >>> (newN, 1, dev_idata);
// down sweep
for (int d = 0; d < ilog2ceil(n); d++) {
nThread = 1 << d;
numBlocks = (nThread + BLOCK_SIZE - 1) / BLOCK_SIZE;
efficientScanDownSweep <<<numBlocks, BLOCK_SIZE >>> (newN, nThread, d, dev_idata);
}
}

/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
int* dev_idata;
int newN = 1 << ilog2ceil(n);
cudaMalloc((void**)&dev_idata, sizeof(int) * newN);
checkCUDAError("Efficient scan: cudaMalloc failed (dev_idata)");
cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
timer().startGpuTimer();
// TODO
//// TODO
efficientScanUpDownSweep(n, newN, dev_idata);

timer().endGpuTimer();
cudaMemcpy(odata, dev_idata, sizeof(int) * n, cudaMemcpyDeviceToHost);
cudaFree(dev_idata);
}

/**
Expand All @@ -31,10 +87,38 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
int newN = 1 << ilog2ceil(n);
int* dev_idata;
cudaMalloc((void**)&dev_idata, sizeof(int) * newN);
checkCUDAError("Efficient scan: cudaMalloc failed (dev_idata)");
int* dev_bools;
cudaMalloc((void**)&dev_bools, sizeof(int) * newN);
checkCUDAError("Efficient scan: cudaMalloc failed (dev_bools)");
int* dev_indices;
cudaMalloc((void**)&dev_indices, sizeof(int) * newN);
checkCUDAError("Efficient scan: cudaMalloc failed (dev_indices)");

cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
dim3 numBlocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
timer().startGpuTimer();
// TODO
Common::kernMapToBoolean <<<numBlocks, BLOCK_SIZE>>> (n, dev_bools, dev_idata);
cudaMemcpy(dev_indices, dev_bools, sizeof(int) * n, cudaMemcpyDeviceToDevice);

efficientScanUpDownSweep(n, newN, dev_indices);

Common::kernScatter <<<numBlocks, BLOCK_SIZE>>> (n, dev_bools, dev_idata, dev_bools, dev_indices);

timer().endGpuTimer();
return -1;
int lastIdx;
cudaMemcpy(&lastIdx, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
int lastBool;
cudaMemcpy(&lastBool, dev_bools + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(odata, dev_bools, sizeof(int) * n, cudaMemcpyDeviceToHost);
cudaFree(dev_idata);
cudaFree(dev_bools);
cudaFree(dev_indices);
return lastIdx + lastBool;
}
}
}
Loading