diff --git a/README.md b/README.md index d63a6a1..9a6c4e3 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,57 @@ **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Yuanqi Wang + * [LinkedIn](https://www.linkedin.com/in/yuanqi-wang-414b26106/), [GitHub](https://github.com/plasmas). +* Tested on: Windows 11, i5-11600K @ 3.91GHz 32GB, RTX 2060 6GB (Personal Desktop) -### (TODO: Your README) +## Visualizations (GIFs) -Include screenshots, analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +### Simulation of 5k boids +![5000 Boids simulation](./images/b_5k.gif) + +### Simulation of 10k boids +![10k Boids simulation](./images/b_10k.gif) + +### Simulation of 20k boids +![20k Boids simulation](./images/b_20k.gif) + +## Performance Analysis + +To measure the performance of the simulation accurately, FPS info is measured at each configuration with visualization disabled. + +### 1. FPS vs. #Boid +With every other parameter fixed, I tested FPS at the following #boid: `1000 5000 10000 50000 100000 200000 500000`. + +![FPS vs. #Boids](./images/fps_vs_boids.svg) +>`Block Size = 128, Scene Scale = 100.0` + +We can see that all implementations has decreasing FPS when #boid increases. This is likely due to drastically more compute loads due to more boids and more neighbors to consider. The naive implementation has the fastest FPS decrease while the coherent implementation has the slowest decrease, which is due to uniform grid optimization and faster data fetching by rearranging data buffers. + +It is worth noting that when #boid is very small, at around `1000`, the naive implementation is actually faster than the simple uniform grid implementation. This might be due to very limited number of neighbors to consider, which leads the optimization overheads to standout. + +### 2. FPS vs. Block Size +Block size at `1 5 10 20 50 100 200 500` are tested to measure the FPS performance. + +![FPS vs. Block Size](。/../images/fps_vs_blocksize.svg) +>`Scene Scale = 100.0 #Boid = 10k` + +We can see that FPS increases with block size for all implementations and generally stops increasing as block size is over 32. This is becuase the maximum number of concurrent wraps per SM on Turing GPUs (RTX 2060) is 32. +When launching kernels with block that has size smaller than 32, each wrap will be underutilized, leading to idle core and resource waste. + +### Questions: +>* For each implementation, how does changing the number of boids affect performance? Why do you think this is? + +In all three implementations, increasing #boids decreases the performance. This is because new velocity and new position must be calculated for each boid, and there is a hardware limit of how much calculation for each boid can be parallelized. When passed, increasing #boids will require more time as we cannot parallelize calculations for all boids. Besides, when holding other parameters, increasing #boids requires to check more neighboring boids to update its velocity, which leads to more time cost and lower performance. + +>* For each implementation, how does changing the block count and block size affect performance? Why do you think this is? + +As analyzed above, increasing the block size leads to increased FPS until a certain point - in my case is 32, beyond which increasing block size has little impact on FPS. This is most likely due to underutilization of wraps because on Turing GPUs, the maximum number of concurrent wraps per SM is 32. When launching a kernel with blocks that have less than 32 threads, some cores will be idle, which leads to waste of resources. + +>* For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not? + +Compared with the original uniform grid implementation, the coherent grid implementation uses rearranged position and velocity data buffers. This means data for boids that are within a cell are gathered more closely together, which means more locality, and thus better utilization of cache and data fetching. + +>* Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check! + +From a theoretical perspective, I would predict that checking 27 neighboring cells is actually more efficient than checking 8 cells. Say the maximum radius of all three rules is $r$. To ensure the surrounding 8 cells contain all potential boids, the side length of each cell must be at least $2r$. Therefore, the entire volume of space we will check for neighboring boids is $8(2r)^3 = 64r^3$. Similarly, if we want to check 27 neighboring cells, the size length of each cell must be at least $r$. Hence the volume of space we will check is now $27r^3$. We see that this volume is generally smaller than $64r^3$ - which means we need to check a smaller volume for neighbors. This in turn means a higher hit rate for boids that are actually taken into consideration and in general less boids we have to check. Therefore, checking 27 neighboring cells might be more efficient, if the optimum side length is chosen for cells. diff --git a/images/b_10k.gif b/images/b_10k.gif new file mode 100644 index 0000000..347d6d3 Binary files /dev/null and b/images/b_10k.gif differ diff --git a/images/b_20k.gif b/images/b_20k.gif new file mode 100644 index 0000000..ada6cde Binary files /dev/null and b/images/b_20k.gif differ diff --git a/images/b_5k.gif b/images/b_5k.gif new file mode 100644 index 0000000..3ca73ed Binary files /dev/null and b/images/b_5k.gif differ diff --git a/images/fps_vs_blocksize.svg b/images/fps_vs_blocksize.svg new file mode 100644 index 0000000..0b090cd --- /dev/null +++ b/images/fps_vs_blocksize.svg @@ -0,0 +1 @@ +FPS vs. Block Size (Visualization OFF)NaiveUniformCoherent010020030000.51510501005001,0005,000Block SizeFPS \ No newline at end of file diff --git a/images/fps_vs_boids.svg b/images/fps_vs_boids.svg new file mode 100644 index 0000000..3b4f978 --- /dev/null +++ b/images/fps_vs_boids.svg @@ -0,0 +1 @@ +FPS vs. #Boid (Visualization OFF)NaiveUniformCoherent0100,000200,000300,000400,000500,00001,0002,0003,000#BoidFPS \ No newline at end of file diff --git a/src/kernel.cu b/src/kernel.cu index 74dffcb..6cf1974 100644 --- a/src/kernel.cu +++ b/src/kernel.cu @@ -5,6 +5,7 @@ #include #include "utilityCore.hpp" #include "kernel.h" +#include "device_launch_parameters.h" // LOOK-2.1 potentially useful for doing grid-based neighbor search #ifndef imax @@ -85,6 +86,8 @@ int *dev_gridCellEndIndices; // to this cell? // TODO-2.3 - consider what additional buffers you might need to reshuffle // the position and velocity data to be coherent within cells. +glm::vec3 *dev_pos_reordered; +glm::vec3 *dev_vel1_reordered; // LOOK-2.1 - Grid parameters based on simulation parameters. // These are automatically computed for you in Boids::initSimulation @@ -169,6 +172,26 @@ void Boids::initSimulation(int N) { gridMinimum.z -= halfGridWidth; // TODO-2.1 TODO-2.3 - Allocate additional buffers here. + // 2.1 + cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed"); + dev_thrust_particleArrayIndices = thrust::device_ptr(dev_particleArrayIndices); + + cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed"); + dev_thrust_particleGridIndices = thrust::device_ptr(dev_particleGridIndices); + + cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed"); + + cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed"); + + // 2.3 + cudaMalloc((void**)&dev_pos_reordered, N * sizeof(glm::vec3)); + checkCUDAErrorWithLine("cudaMalloc dev_pos_reordered failed"); + cudaMalloc((void**)&dev_vel1_reordered, N * sizeof(glm::vec3)); + checkCUDAErrorWithLine("cudaMalloc dev_vel1_reordered failed"); cudaDeviceSynchronize(); } @@ -233,7 +256,44 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves // Rule 2: boids try to stay a distance d away from each other // Rule 3: boids try to match the speed of surrounding boids - return glm::vec3(0.0f, 0.0f, 0.0f); + glm::vec3 perceived_center(0.0f, 0.0f, 0.0f); + glm::vec3 other_boids_pos_sum(0.0f, 0.0f, 0.0f); + glm::vec3 perceived_velocity(0.0f, 0.0f, 0.0f); + // position and velocity of + glm::vec3 iPos = pos[iSelf]; + glm::vec3 iVel = vel[iSelf]; + int rule1_count = 0; + int rule3_count = 0; + for (int i = 0; i < N; i++) { + if (iSelf == i) { + continue; + } + glm::vec3 neighbor_pos = pos[i]; + glm::vec3 neighbor_vel = vel[i]; + float dist = glm::distance(iPos, neighbor_pos); + if (dist < rule1Distance) { + perceived_center += neighbor_pos; + rule1_count++; + } + if (dist < rule2Distance) { + other_boids_pos_sum -= (neighbor_pos - iPos); + } + if (dist < rule3Distance) { + perceived_velocity += neighbor_vel; + rule3_count++; + } + } + glm::vec3 vel_change(0.0f, 0.0f, 0.0f); + if (rule1_count > 0) { + perceived_center /= rule1_count; + vel_change += (perceived_center - iPos) * rule1Scale; + } + vel_change += other_boids_pos_sum * rule2Scale; + if (rule3_count > 0) { + perceived_velocity /= rule3_count; + vel_change += perceived_velocity * rule3Scale; + } + return vel_change; } /** @@ -245,6 +305,17 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos, // Compute a new velocity based on pos and vel1 // Clamp the speed // Record the new velocity into vel2. Question: why NOT vel1? + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + glm::vec3 vel = vel1[index]; + glm::vec3 vel_change = computeVelocityChange(N, index, pos, vel1); + vel += vel_change; + if (glm::length(vel) > maxSpeed) { + vel = glm::normalize(vel) * maxSpeed; + } + vel2[index] = vel; } /** @@ -277,7 +348,7 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) { // order for iterating over neighboring grid cells? // for(x) // for(y) -// for(z)? Or some other order? +// for(z)? Or some other order? z -> y -> x __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) { return x + y * gridResolution + z * gridResolution * gridResolution; } @@ -285,10 +356,19 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) { __global__ void kernComputeIndices(int N, int gridResolution, glm::vec3 gridMin, float inverseCellWidth, glm::vec3 *pos, int *indices, int *gridIndices) { - // TODO-2.1 - // - Label each boid with the index of its grid cell. - // - Set up a parallel array of integer indices as pointers to the actual - // boid data in pos and vel1/vel2 + // TODO-2.1 + // - Label each boid with the index of its grid cell. + // - Set up a parallel array of integer indices as pointers to the actual + // boid data in pos and vel1/vel2 + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + glm::vec3 iPos = pos[index]; + glm::ivec3 gridCoord = glm::floor((iPos - gridMin) * inverseCellWidth); + int gridIndex = gridIndex3Dto1D(gridCoord.x, gridCoord.y, gridCoord.z, gridResolution); + indices[index] = index; + gridIndices[index] = gridIndex; } // LOOK-2.1 Consider how this could be useful for indicating that a cell @@ -306,6 +386,31 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices, // Identify the start point of each cell in the gridIndices array. // This is basically a parallel unrolling of a loop that goes // "this index doesn't match the one before it, must be a new cell!" + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + int gridCellIndex = particleGridIndices[index]; + if (index == 0 || particleGridIndices[index - 1] != gridCellIndex) { + gridCellStartIndices[gridCellIndex] = index; + } + if (index == N - 1 || particleGridIndices[index + 1] != gridCellIndex) { + gridCellEndIndices[gridCellIndex] = index; + } +} + +__global__ void kernScatter(int N, int* particleArrayIndices, + glm::vec3* pos_reordered, glm::vec3* pos, + glm::vec3* vel_reordered, glm::vec3* vel) { + // 2.3 helper + // rearranges pos and vel according to new index arrangements + // similar to scatter function in pyTorch + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + pos_reordered[index] = pos[particleArrayIndices[index]]; + vel_reordered[index] = vel[particleArrayIndices[index]]; } __global__ void kernUpdateVelNeighborSearchScattered( @@ -316,12 +421,91 @@ __global__ void kernUpdateVelNeighborSearchScattered( glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) { // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce // the number of boids that need to be checked. + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + // - Identify the grid cell that this particle is in + glm::vec3 iPos = pos[index]; + glm::vec3 iVel = vel1[index]; + glm::ivec3 gridCoord = glm::floor((iPos - gridMin) * inverseCellWidth); + int gridIndex = gridIndex3Dto1D(gridCoord.x, gridCoord.y, gridCoord.z, gridResolution); + // - Identify which cells may contain neighbors. This isn't always 8. + glm::vec3 gridCenter = gridMin + cellWidth * (glm::vec3(gridCoord) + 0.5f); + // check next or previous cell in each dimension? + int dx = iPos.x < gridCenter.x ? -1 : 1; + int dy = iPos.y < gridCenter.y ? -1 : 1; + int dz = iPos.z < gridCenter.z ? -1 : 1; + + // ranges to search w/ boundary check + int x_min = imax(imin(gridCoord.x, gridCoord.x + dx), 0); + int x_max = imin(imax(gridCoord.x, gridCoord.x + dx), gridResolution - 1); + int y_min = imax(imin(gridCoord.y, gridCoord.y + dy), 0); + int y_max = imin(imax(gridCoord.y, gridCoord.y + dy), gridResolution - 1); + int z_min = imax(imin(gridCoord.z, gridCoord.z + dz), 0); + int z_max = imin(imax(gridCoord.z, gridCoord.z + dz), gridResolution - 1); + // - For each cell, read the start/end indices in the boid pointer array. // - Access each boid in the cell and compute velocity change from // the boids rules, if this boid is within the neighborhood distance. + glm::vec3 perceived_center(0.0f, 0.0f, 0.0f); + glm::vec3 other_boids_pos_sum(0.0f, 0.0f, 0.0f); + glm::vec3 perceived_velocity(0.0f, 0.0f, 0.0f); + int rule1_count = 0; + int rule3_count = 0; + for (int z = z_min; z <= z_max; z++) { + for (int y = y_min; y <= y_max; y++) { + for (int x = x_min; x <= x_max; x++) { + // find index range of boids in particleArrayIndices + int gridIndex = gridIndex3Dto1D(x, y, z, gridResolution); + int startIndex = gridCellStartIndices[gridIndex]; + int endIndex = gridCellEndIndices[gridIndex]; + // skip grid cell if it contains no boids + if (startIndex == -1) { + continue; + } + for (int i = startIndex; i <= endIndex; i++) { + int neighborIndex = particleArrayIndices[i]; + glm::vec3 neighbor_pos = pos[neighborIndex]; + glm::vec3 neighbor_vel = vel1[neighborIndex]; + if (neighborIndex == index) { + continue; + } + float dist = glm::distance(iPos, neighbor_pos); + if (dist < rule1Distance) { + perceived_center += neighbor_pos; + rule1_count++; + } + if (dist < rule2Distance) { + other_boids_pos_sum -= (neighbor_pos - iPos); + } + if (dist < rule3Distance) { + perceived_velocity += neighbor_vel; + rule3_count++; + } + } + } + } + } + glm::vec3 vel_change(0.0f, 0.0f, 0.0f); + if (rule1_count > 0) { + perceived_center /= rule1_count; + vel_change += (perceived_center - iPos) * rule1Scale; + } + vel_change += other_boids_pos_sum * rule2Scale; + if (rule3_count > 0) { + perceived_velocity /= rule3_count; + vel_change += perceived_velocity * rule3Scale; + } + // - Clamp the speed change before putting the new speed in vel2 + glm::vec3 vel_new = iVel + vel_change; + if (glm::length(vel_new) > maxSpeed) { + vel_new = glm::normalize(vel_new) * maxSpeed; + } + vel2[index] = vel_new; } __global__ void kernUpdateVelNeighborSearchCoherent( @@ -333,14 +517,92 @@ __global__ void kernUpdateVelNeighborSearchCoherent( // except with one less level of indirection. // This should expect gridCellStartIndices and gridCellEndIndices to refer // directly to pos and vel1. + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + // - Identify the grid cell that this particle is in + glm::vec3 iPos = pos[index]; + glm::vec3 iVel = vel1[index]; + glm::ivec3 gridCoord = glm::floor((iPos - gridMin) * inverseCellWidth); + int gridIndex = gridIndex3Dto1D(gridCoord.x, gridCoord.y, gridCoord.z, gridResolution); + // - Identify which cells may contain neighbors. This isn't always 8. + glm::vec3 gridCenter = gridMin + cellWidth * (glm::vec3(gridCoord) + 0.5f); + // check next or previous cell in each dimension? + int dx = iPos.x < gridCenter.x ? -1 : 1; + int dy = iPos.y < gridCenter.y ? -1 : 1; + int dz = iPos.z < gridCenter.z ? -1 : 1; + + // ranges to search w/ boundary check + int x_min = imax(imin(gridCoord.x, gridCoord.x + dx), 0); + int x_max = imin(imax(gridCoord.x, gridCoord.x + dx), gridResolution - 1); + int y_min = imax(imin(gridCoord.y, gridCoord.y + dy), 0); + int y_max = imin(imax(gridCoord.y, gridCoord.y + dy), gridResolution - 1); + int z_min = imax(imin(gridCoord.z, gridCoord.z + dz), 0); + int z_max = imin(imax(gridCoord.z, gridCoord.z + dz), gridResolution - 1); + // - For each cell, read the start/end indices in the boid pointer array. // DIFFERENCE: For best results, consider what order the cells should be // checked in to maximize the memory benefits of reordering the boids data. // - Access each boid in the cell and compute velocity change from // the boids rules, if this boid is within the neighborhood distance. + glm::vec3 perceived_center(0.0f, 0.0f, 0.0f); + glm::vec3 other_boids_pos_sum(0.0f, 0.0f, 0.0f); + glm::vec3 perceived_velocity(0.0f, 0.0f, 0.0f); + int rule1_count = 0; + int rule3_count = 0; + for (int z = z_min; z <= z_max; z++) { + for (int y = y_min; y <= y_max; y++) { + for (int x = x_min; x <= x_max; x++) { + // find index range of boids in particleArrayIndices + int gridIndex = gridIndex3Dto1D(x, y, z, gridResolution); + int startIndex = gridCellStartIndices[gridIndex]; + int endIndex = gridCellEndIndices[gridIndex]; + // skip grid cell if it contains no boids + if (startIndex == -1) { + continue; + } + for (int i = startIndex; i <= endIndex; i++) { + glm::vec3 neighbor_pos = pos[i]; + glm::vec3 neighbor_vel = vel1[i]; + if (i == index) { + continue; + } + float dist = glm::distance(iPos, neighbor_pos); + if (dist < rule1Distance) { + perceived_center += neighbor_pos; + rule1_count++; + } + if (dist < rule2Distance) { + other_boids_pos_sum -= (neighbor_pos - iPos); + } + if (dist < rule3Distance) { + perceived_velocity += neighbor_vel; + rule3_count++; + } + } + } + } + } + glm::vec3 vel_change(0.0f, 0.0f, 0.0f); + if (rule1_count > 0) { + perceived_center /= rule1_count; + vel_change += (perceived_center - iPos) * rule1Scale; + } + vel_change += other_boids_pos_sum * rule2Scale; + if (rule3_count > 0) { + perceived_velocity /= rule3_count; + vel_change += perceived_velocity * rule3Scale; + } + // - Clamp the speed change before putting the new speed in vel2 + glm::vec3 vel_new = iVel + vel_change; + if (glm::length(vel_new) > maxSpeed) { + vel_new = glm::normalize(vel_new) * maxSpeed; + } + vel2[index] = vel_new; } /** @@ -349,39 +611,170 @@ __global__ void kernUpdateVelNeighborSearchCoherent( void Boids::stepSimulationNaive(float dt) { // TODO-1.2 - use the kernels you wrote to step the simulation forward in time. // TODO-1.2 ping-pong the velocity buffers + dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize); + kernUpdateVelocityBruteForce<<>>(numObjects, dev_pos, dev_vel1, dev_vel2); + checkCUDAErrorWithLine("kernel update velocity bruteforce failed"); + std::swap(dev_vel1, dev_vel2); + kernUpdatePos<<>>(numObjects, dt, dev_pos, dev_vel1); + checkCUDAErrorWithLine("kernel update pos failed"); } void Boids::stepSimulationScatteredGrid(float dt) { // TODO-2.1 // Uniform Grid Neighbor search using Thrust sort. // In Parallel: + dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize); + dim3 fullBlocksPerCell((gridCellCount + blockSize - 1) / blockSize); + // - label each particle with its array index as well as its grid index. // Use 2x width grids. + kernComputeIndices<<>>( + numObjects, + gridSideCount, + gridMinimum, + gridInverseCellWidth, + dev_pos, + dev_particleArrayIndices, + dev_particleGridIndices + ); + // - Unstable key sort using Thrust. A stable sort isn't necessary, but you // are welcome to do a performance comparison. + thrust::sort_by_key( + dev_thrust_particleGridIndices, + dev_thrust_particleGridIndices + numObjects, + dev_thrust_particleArrayIndices + ); + // - Naively unroll the loop for finding the start and end indices of each // cell's data pointers in the array of boid indices + kernResetIntBuffer<<>>( + gridCellCount, + dev_gridCellStartIndices, + -1 + ); + kernResetIntBuffer<<>>( + gridCellCount, + dev_gridCellEndIndices, + -1 + ); + kernIdentifyCellStartEnd<<>>( + numObjects, + dev_particleGridIndices, + dev_gridCellStartIndices, + dev_gridCellEndIndices + ); + // - Perform velocity updates using neighbor search + kernUpdateVelNeighborSearchScattered<<>>( + numObjects, + gridSideCount, + gridMinimum, + gridInverseCellWidth, + gridCellWidth, + dev_gridCellStartIndices, + dev_gridCellEndIndices, + dev_particleArrayIndices, + dev_pos, + dev_vel1, + dev_vel2 + ); + // - Update positions + kernUpdatePos<<>>( + numObjects, + dt, + dev_pos, + dev_vel2 + ); + // - Ping-pong buffers as needed + std::swap(dev_vel1, dev_vel2); } void Boids::stepSimulationCoherentGrid(float dt) { // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid // Uniform Grid Neighbor search using Thrust sort on cell-coherent data. // In Parallel: + dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize); + dim3 fullBlocksPerCell((gridCellCount + blockSize - 1) / blockSize); + // - Label each particle with its array index as well as its grid index. // Use 2x width grids + kernComputeIndices<<>>( + numObjects, + gridSideCount, + gridMinimum, + gridInverseCellWidth, + dev_pos, + dev_particleArrayIndices, + dev_particleGridIndices + ); + // - Unstable key sort using Thrust. A stable sort isn't necessary, but you // are welcome to do a performance comparison. + thrust::sort_by_key( + dev_thrust_particleGridIndices, + dev_thrust_particleGridIndices + numObjects, + dev_thrust_particleArrayIndices + ); + // - Naively unroll the loop for finding the start and end indices of each // cell's data pointers in the array of boid indices + kernResetIntBuffer<<>>( + gridCellCount, + dev_gridCellStartIndices, + -1 + ); + kernResetIntBuffer<<>> ( + gridCellCount, + dev_gridCellEndIndices, + -1 + ); + kernIdentifyCellStartEnd<<>> ( + numObjects, + dev_particleGridIndices, + dev_gridCellStartIndices, + dev_gridCellEndIndices + ); + // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all // the particle data in the simulation array. + kernScatter<<>>( + numObjects, + dev_particleArrayIndices, + dev_pos_reordered, + dev_pos, + dev_vel1_reordered, + dev_vel1 + ); + // CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED // - Perform velocity updates using neighbor search + kernUpdateVelNeighborSearchCoherent<<>>( + numObjects, + gridSideCount, + gridMinimum, + gridInverseCellWidth, + gridCellWidth, + dev_gridCellStartIndices, + dev_gridCellEndIndices, + dev_pos_reordered, + dev_vel1_reordered, + dev_vel2 + ); + // - Update positions + kernUpdatePos<<>>( + numObjects, + dt, + dev_pos_reordered, + dev_vel2 + ); + // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE. + std::swap(dev_pos, dev_pos_reordered); + std::swap(dev_vel1, dev_vel2); } void Boids::endSimulation() { @@ -390,6 +783,14 @@ void Boids::endSimulation() { cudaFree(dev_pos); // TODO-2.1 TODO-2.3 - Free any additional buffers here. + // 2.1 + cudaFree(dev_particleArrayIndices); + cudaFree(dev_particleGridIndices); + cudaFree(dev_gridCellStartIndices); + cudaFree(dev_gridCellEndIndices); + // 2.3 + cudaFree(dev_pos_reordered); + cudaFree(dev_vel1_reordered); } void Boids::unitTest() { diff --git a/src/main.cpp b/src/main.cpp index b82c8c6..8a53937 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -14,11 +14,11 @@ // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID #define VISUALIZE 1 -#define UNIFORM_GRID 0 -#define COHERENT_GRID 0 +#define UNIFORM_GRID 1 +#define COHERENT_GRID 1 // LOOK-1.2 - change this to adjust particle count in the simulation -const int N_FOR_VIS = 5000; +const int N_FOR_VIS = 20000; const float DT = 0.2f; /** diff --git a/src/main.hpp b/src/main.hpp index 88e9df7..d9b9fc2 100644 --- a/src/main.hpp +++ b/src/main.hpp @@ -36,7 +36,7 @@ const float fovy = (float) (PI / 4); const float zNear = 0.10f; const float zFar = 10.0f; // LOOK-1.2: for high DPI displays, you may want to double these settings. -int width = 1280; +int width = 1331; int height = 720; int pointSize = 2;