Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Constance Wang #20

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 139 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,145 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
Constance Wang
* [LinkedIn](https://www.linkedin.com/in/conswang/)

### (TODO: Your README)
Tested on AORUS 15P XD laptop with specs:
- Windows 11 22000.856
- 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz 2.30 GHz
- NVIDIA GeForce RTX 3070 Laptop GPU

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
I implemented the following parallel algorithms on the GPU and benchmarked them against my own CPU implementations and Thrust on the GPU:
- Naive Scan
- Work efficient scan
- Stream compaction
- Radix sort

I roughly optimized the block size for each algorithm by seeing what block size performed the fastest on arrays of size 2^22 (approx 4 million).

| Block Size | Runtime - Naive (ms) | Runtime - Work efficient (ms) |
| ----------- | ----------- | ----------- |
32 | 4.21734 | 1.32106
64 |2.11395 |1.36259
128 |2.09267| 1.26221
256| 2.09258 |1.28563
384 |2.11395 |1.3327
768 |2.11405| 1.26701

Performance was pretty similar for most block sizes, but started to suffer for both naive and work efficient at around 32 or 64 threads per block. In this case, I decided to use a block size of 128 threads to compare the algorithms on different array sizes.

![](img/Performance%20of%20Scan%20Implementations%20on%20Different%20Array%20Sizes.svg)

| Array size | CPU (ms) | Naive (ms) | Work efficient (ms) | Thrust (ms) |
|------------|---------|----------|----------------|----------|
| 65536 | 0.1023 | 0.158464 | 0.266592 | 0.045472 |
| 262144 | 0.399 | 0.2616 | 0.33888 | 0.194144 |
| 1048576 | 1.6835 | 0.636288 | 0.472416 | 0.351648 |
| 4194304 | 6.392 | 2.20544 | 1.27302 | 0.523776 |
| 16777216 | 25.5751 | 8.98938 | 4.05302 | 1.09213 |
| 67108864 | 100.736 | 38.8708 | 15.4414 | 2.14362 |
| 268435456 | 410.365 | 169.486 | 60.6265 | 6.23341 |

### Analysis
The CPU implementation's run time appears to be linear with respect to the number of array elements. This makes sense because each element is processed one at a time inside a for loop.

Thrust is the fastest by far. This is probably because they are using shared memory, while all of my implementations only use global memory which is much slower to access, making each kernel thread slower. And maybe other optimizations as well.

The work-efficient scan is faster than the naive scan. This should be because I made optimizations (see next section) to reduce the number of threads at each iteration, whereas the naive scan still launches n threads each iteration.

In all implementations, computation should not be a performance bottleneck, since each kernel runs in about O(1) time, we can't really do better than that.

Aside from the above trends, memory IO (cudaMemcpy) is a giant performance bottleneck. This is not shown in the performance graph since we start measuring runtime after the initial cudaMemcpy and stop measuring before the final cudaMemcpy. Still, cudaMemcpy runs in O(n) time, which effectively makes any GPU algorithm O(n), even though the actual implementation of scan runs in O(log n).

However, in practice, cudaMemcpy is still very fast, probably because the hardware bandwith for copying data from host to device is very large. For example, on an array of size 2^26, I ran my Radix sort algorithms and the CPU implementation took about 7 seconds (7131.1ms). Meanwhile the GPU implementation took about 1 second (826.585ms) including the cudaMemcpy, and half a second (434.71ms) without the cudaMemcpy. This means that while the cudaMemcpy is still a huge bottleneck on the GPU performance (taking up about half the runtime), it isn't too the point of being linear time, even at large numbers. In the future, I could try to measure the bandwidth of cudaMemcpy on my GPU.

### Extra credit

#### Performance
My work efficient scan halves the total number of threads launched each iteration, this means less threads are idling and taking up space on the GPU multi-processors while other threads could be running. As a result, the work efficient scan is faster at than naive and CPU implementation at array sizes of 2^18 and larger.

#### Radix sort

I implemented Radix sort on the CPU, wrote two test cases, and added Radix sort on the GPU, which calls my work efficient scan. The functions to look at are:
- `radixSort` in `naive.cu`
- `radixSort` in `cpu.cu`
Example of usage:
`StreamCompaction::CPU::radixSort(NPOT, RADIX_NUM_BITS, b, a);`

A few notes: you can pass in the number of bits you want to sort by, which should be `ilog2ceil(MAX_ARRAY_ELEMENT_VALUE)`. Also, I assumed for simplicity that each element is a positive integer (although still using int and not unsigned int types) so I can just use a bitmask to compact the arrays. Finally, to test, the array size should not be too close to 2^31 because of integer overflow issues...

#### Sample program output
Settings: `blockSize` = 128, `SIZE` = 1 << 26

```

****************
** SCAN TESTS **
****************
[ 29 25 4 37 8 30 21 31 8 21 22 19 3 ... 49 0 ]
==== cpu scan, power-of-two ====
elapsed time: 102.046ms (std::chrono Measured)
[ 0 29 54 58 95 103 133 154 185 193 214 236 255 ... 1643658734 1643658783 ]
==== cpu scan, non-power-of-two ====
elapsed time: 99.084ms (std::chrono Measured)
[ 0 29 54 58 95 103 133 154 185 193 214 236 255 ... 1643658648 1643658670 ]
passed
==== naive scan, power-of-two ====
elapsed time: 38.8846ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 37.4897ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 15.4577ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 15.4086ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 2.09901ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 2.36995ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 0 3 2 3 3 1 3 2 2 3 1 3 0 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 137.514ms (std::chrono Measured)
[ 3 2 3 3 1 3 2 2 3 1 3 1 3 ... 2 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 137.567ms (std::chrono Measured)
[ 3 2 3 3 1 3 2 2 3 1 3 1 3 ... 2 1 ]
passed
==== cpu compact with scan ====
elapsed time: 348.893ms (std::chrono Measured)
[ 3 2 3 3 1 3 2 2 3 1 3 1 3 ... 2 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 19.1836ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 19.1201ms (CUDA Measured)
passed

*****************************
** RADIX SORT TESTS **
*****************************
[ 31399 13580 25635 22845 23360 14322 9628 3467 20074 16251 14385 30083 26014 ... 230 0 ]
==== cpu radix sort, power-of-two ====
elapsed time: 7131.1ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ]
==== radix sort, power of two ====
elapsed time: 826.585ms (CUDA Measured)
passed
==== cpu radix sort, non-power-of-two ====
elapsed time: 7102.31ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ]
==== radix sort, non-power of two ====
elapsed time: 788.974ms (CUDA Measured)
passed
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
55 changes: 55 additions & 0 deletions performance-at-last-commit.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
On length = 1 << 18

****************
** SCAN TESTS **
****************
[ 4 25 22 2 14 31 30 0 44 13 11 45 24 ... 5 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.3819ms (std::chrono Measured)
[ 0 4 29 51 53 67 98 128 128 172 185 196 241 ... 6433433 6433438 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.4017ms (std::chrono Measured)
[ 0 4 29 51 53 67 98 128 128 172 185 196 241 ... 6433393 6433393 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.327168ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.229856ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.331456ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.343808ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.146592ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.190304ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 3 0 0 1 3 3 3 3 1 1 1 1 2 ... 1 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.5161ms (std::chrono Measured)
[ 3 1 3 3 3 3 1 1 1 1 2 3 3 ... 2 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.5447ms (std::chrono Measured)
[ 3 1 3 3 3 3 1 1 1 1 2 3 3 ... 2 3 ]
passed
==== cpu compact with scan ====
elapsed time: 1.4394ms (std::chrono Measured)
[ 3 1 3 3 3 3 1 1 1 1 2 3 3 ... 2 1 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.3688ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.500928ms (CUDA Measured)
passed
Appuyez sur une touche pour continuer...
39 changes: 37 additions & 2 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
const int SIZE = 1 << 26; // feel free to change the size of array = 256
const int NPOT = SIZE - 3; // Non-Power-Of-Two = 253
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];
Expand Down Expand Up @@ -147,6 +147,41 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

printf("\n");
printf("*****************************\n");
printf("** RADIX SORT TESTS **\n");
printf("*****************************\n");

genArray(SIZE - 1, a, 696969); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

#define RADIX_NUM_BITS ilog2ceil(696969)

zeroArray(SIZE, b);
printDesc("cpu radix sort, power-of-two");
StreamCompaction::CPU::radixSort(SIZE, RADIX_NUM_BITS, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(SIZE, b, true);

zeroArray(SIZE, c);
printDesc("radix sort, power of two");
StreamCompaction::Naive::radixSort(SIZE, RADIX_NUM_BITS, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printCmpResult(SIZE, b, c); // TODO: add cpu impl and write to b for comparison

zeroArray(SIZE, b);
printDesc("cpu radix sort, non-power-of-two");
StreamCompaction::CPU::radixSort(NPOT, RADIX_NUM_BITS, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(NPOT, b, true);

zeroArray(SIZE, c);
printDesc("radix sort, non-power of two");
StreamCompaction::Naive::radixSort(NPOT, RADIX_NUM_BITS, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printCmpResult(NPOT, b, c); // TODO: add cpu impl and write to b for comparison

system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
Expand Down
21 changes: 20 additions & 1 deletion stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ void checkCUDAErrorFn(const char *msg, const char *file, int line) {

namespace StreamCompaction {
namespace Common {

/**
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
Expand All @@ -35,5 +34,25 @@ namespace StreamCompaction {
// TODO
}

// unlike naive impl, this one doesn't shift the array
__global__ void kernPadArray(int n, int paddedLen, int* odata, const int* idata) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index < n) {
odata[index] = idata[index];
}
else if (index < paddedLen) {
odata[index] = 0;
}
}

__global__ void kernGetPaddedBoolArray(int n, int paddedLength, int* odata, const int* idata) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index < n) {
odata[index] = idata[index] == 0 ? 0 : 1;
}
else if (index < paddedLength) {
odata[index] = 0;
}
}
}
}
14 changes: 14 additions & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,27 @@ inline int ilog2ceil(int x) {
return x == 1 ? 0 : ilog2(x - 1) + 1;
}

inline void printCudaArray(int n, int* dev_array) {
int* tempArray = (int*)malloc(n * sizeof(int));
cudaMemcpy(tempArray, dev_array, n * sizeof(int), cudaMemcpyDeviceToHost);
printf("Print array -----------\n");
for (int i = 0; i < n; ++i) {
printf("%d ", tempArray[i]);
}
free(tempArray);
}

namespace StreamCompaction {
namespace Common {
__global__ void kernMapToBoolean(int n, int *bools, const int *idata);

__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices);

__global__ void kernPadArray(int n, int paddedLength, int* odata, const int* idata);

__global__ void kernGetPaddedBoolArray(int n, int paddedLength, int* odata, const int* idata);

/**
* This class is used for timing the performance
* Uncopyable and unmovable
Expand Down
Loading