Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Zhangkaiwen Chu #30

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .ipynb_checkpoints/Untitled-checkpoint.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}
40 changes: 34 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,40 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Zhangkaiwen Chu
* [LinkedIn](https://www.linkedin.com/in/zhangkaiwen-chu-b53060225/)
* Tested on: Windows 10, R7-5800H @ 3.20GHz 16GB, RTX 3070 Laptop GPU 16310MB (Personal Laptop)

### (TODO: Your README)
This project implement GPU stream compaction in CUDA from scratch, including naive scan and work-efficient scan described in GPU Gem3 Ch39.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
## Features:
* CPU Scan & Stream Compaction
* Naive GPU Scan Algorithm
* Work-Efficient GPU Scan & Stream Compaction
* Thrust's Scan

## Optimize Block Size:
Tested with array size = 2^28
![](img/1.png)
* Optimal block size for naive scan: 128
* Optimal block size for naive scan: 128

## Scan Performance Comparation
![](img/2.png)

## Compact Performance Comparation
![](img/3.png)

## Analysis
The bottlenecks for different implementation is different, and it varys with different array size.

* For cpu implementation, the run time is always linear with the array size. The bottleneck should be the memory I/O.
* For naive scan, it first scales with the log of array size, then becomes linear. Note that the total commputation is linear, while, the number of function calls are log, so I think when the array size is small, the branch divergence and function calls are the predominant factor, while when the array size is large, the memory I/O is the main factor.
* For work-efficient scan, the pattern is alike. However, it is much slower when the array size is small. I found that cudaDeviceSynchronize() consumes most of the time. Note that the number of calls to synchronization also scales with log of array size. When the array size is large, memory I/O becomes the main factor.
* The thrust implementation is very efficient. It seems to change blocksize dynamically, and have fewer memory I/O.

## Why is My GPU Approach So Slow?
My work-efficient implementation outperforme cpu version with array size larger than 2^24. It is really a large array size, so it shows my implementation is not that efficient. Most threads in upper/down sweep is not working. However when the arraysize is very large, the compuitation advantage overcomes the overhead.

## Output of the Test Program
![](img/4.png)
134 changes: 134 additions & 0 deletions Untitled.ipynb

Large diffs are not rendered by default.

Binary file added img/1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 28; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -140,7 +140,7 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
Expand Down
12 changes: 12 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,11 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
bools[index] = idata[index] == 0 ? 0 : 1;
}

/**
Expand All @@ -33,6 +38,13 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
if (bools[index] != 0) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
29 changes: 26 additions & 3 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
odata[0] = 0;
for (int i = 1; i < n; i++) {
odata[i] = odata[i - 1] + idata[i - 1];
}
timer().endCpuTimer();
}

Expand All @@ -31,8 +35,15 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int i = 0;
for (int j = 0; j < n; j++) {
if (idata[j] != 0) {
odata[i] = idata[j];
i++;
}
}
timer().endCpuTimer();
return -1;
return i;
}

/**
Expand All @@ -41,10 +52,22 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();

// TODO
int* temp = new int[n];
timer().startCpuTimer();
// scan(n, temp, idata);
temp[0] = 0;
for (int i = 1; i < n; i++) {
temp[i] = temp[i - 1] + (idata[i - 1] == 0 ? 0 : 1);
}
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[temp[i]] = idata[i];
}
}
timer().endCpuTimer();
return -1;
return temp[n-1];
}
}
}
102 changes: 100 additions & 2 deletions stream_compaction/efficient.cu
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
#include "common.h"
#include "efficient.h"

#define blockSize 128

namespace StreamCompaction {
namespace Efficient {
using StreamCompaction::Common::PerformanceTimer;
Expand All @@ -11,16 +13,70 @@ namespace StreamCompaction {
static PerformanceTimer timer;
return timer;
}

__global__ void kernUpSweep(int n, int d, int* x) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
if (index % (1 << (d + 1)) == 0) {
x[index + (1 << (d + 1)) - 1] += x[index + (1 << d) - 1];
}
}

__global__ void kernDownSweep(int n, int d, int* x) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
if (index % (1 << (d + 1)) == 0) {
int t = x[index + (1 << d) - 1];
x[index + (1 << d) - 1] = x[index + (1 << (d+1)) - 1];
x[index + (1 << (d + 1)) - 1] = t + x[index + (1 << (d + 1)) - 1];
}
}


/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
void scan(int n, int* odata, const int* idata) {
int* dev_idata;
int size = 1 << ilog2ceil(n);

cudaMalloc((void**)&dev_idata, size * sizeof(int));

cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);

dim3 fullBlocksPerGrid((size + blockSize - 1) / blockSize);
timer().startGpuTimer();
// TODO

for (int d = 0; d < ilog2ceil(size); d++) {
cudaDeviceSynchronize();
kernUpSweep <<<fullBlocksPerGrid, blockSize>>> (n, d, dev_idata);
}

cudaDeviceSynchronize();
cudaMemset(dev_idata + size - 1, 0, sizeof(int));

for (int d = ilog2ceil(size) - 1; d >= 0; d--) {
cudaDeviceSynchronize();
kernDownSweep <<<fullBlocksPerGrid, blockSize>>> (n, d, dev_idata);
}


timer().endGpuTimer();

cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost);\

cudaFree(dev_idata);
}





/**
* Performs stream compaction on idata, storing the result into odata.
* All zeroes are discarded.
Expand All @@ -31,10 +87,52 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
int* dev_odata;
int* dev_idata;
int* dev_indices;
int* dev_bools;
int size = 1 << ilog2ceil(n);
int retSize = 0;

cudaMalloc((void**)&dev_idata, size * sizeof(int));
cudaMalloc((void**)&dev_odata, size * sizeof(int));
cudaMalloc((void**)&dev_bools, size * sizeof(int));
cudaMalloc((void**)&dev_indices, size * sizeof(int));

cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);

dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
dim3 fullBlocksPerGridCeil((size + blockSize - 1) / blockSize);
timer().startGpuTimer();
// TODO
Common::kernMapToBoolean <<<fullBlocksPerGrid, blockSize>>> (n, dev_bools, dev_idata);
cudaDeviceSynchronize();
cudaMemcpy(dev_indices, dev_bools, n * sizeof(int), cudaMemcpyDeviceToDevice);

for (int d = 0; d < ilog2ceil(size); d++) {
kernUpSweep <<<fullBlocksPerGridCeil, blockSize >>> (n, d, dev_indices);
cudaDeviceSynchronize();
}

cudaMemset(dev_indices + size - 1, 0, sizeof(int));

for (int d = ilog2ceil(size) - 1; d >= 0; d--) {
kernDownSweep <<<fullBlocksPerGridCeil, blockSize>>> (n, d, dev_indices);
cudaDeviceSynchronize();
}

Common::kernScatter <<<fullBlocksPerGrid, blockSize>>> (n, dev_odata, dev_idata, dev_bools, dev_indices);

timer().endGpuTimer();
return -1;
cudaMemcpy(&retSize, dev_indices + size - 1, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(odata, dev_odata, retSize * sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(dev_bools);
cudaFree(dev_indices);
cudaFree(dev_odata);
cudaFree(dev_idata);

return retSize;
}
}
}
40 changes: 39 additions & 1 deletion stream_compaction/naive.cu
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
#include "common.h"
#include "naive.h"

#define blockSize 128

namespace StreamCompaction {
namespace Naive {
using StreamCompaction::Common::PerformanceTimer;
Expand All @@ -12,14 +14,50 @@ namespace StreamCompaction {
return timer;
}
// TODO: __global__
__global__ void kernNaiveScan(int n, int d, int* odata, int* idata) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
if (index >= (1 << (d-1))) {
odata[index] = idata[index - (1 << (d - 1))] + idata[index];
}
else {
odata[index] = idata[index];
}
}




/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
void scan(int n, int* odata, const int* idata) {
int* dev_odata;
int* dev_idata;

cudaMalloc((void**)&dev_odata, n * sizeof(int));
cudaMalloc((void**)&dev_idata, n * sizeof(int));

cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);

dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
timer().startGpuTimer();
// TODO
for (int d = 1; d <= ilog2ceil(n); d++) {
kernNaiveScan <<<fullBlocksPerGrid, blockSize>>> (n, d, dev_odata, dev_idata);
std::swap(dev_odata, dev_idata);
}
timer().endGpuTimer();

odata[0] = 0;
cudaMemcpy(odata + 1, dev_idata, (n - 1) * sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(dev_odata);
cudaFree(dev_idata);
}
}
}


5 changes: 5 additions & 0 deletions stream_compaction/thrust.cu
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,16 @@ namespace StreamCompaction {
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
thrust::device_vector<int> dv_in(idata, idata + n);
thrust::device_vector<int> dv_out(n);
timer().startGpuTimer();
// TODO use `thrust::exclusive_scan`
// example: for device_vectors dv_in and dv_out:
// thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
timer().endGpuTimer();

thrust::copy(dv_out.begin(), dv_out.end(), odata);
}
}
}