diff --git a/README.md b/README.md
index d63a6a1..62f729e 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,41 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Nithin Pranesh
+  * Here are some links to connect with me: [LinkedIn](https://www.linkedin.com/in/nithin-pranesh), [YouTube](https://www.youtube.com/channel/UCAQwYrQblfN8qeDW28KkH7g/featured), [Twitter](https://twitter.com/NithinPranesh1).
+* Tested on: XPS 15 7590, Windows 20H2, i7-9750H @ 2.60GHz 22GB, GTX 1650.
 
-### (TODO: Your README)
+### Project 1: CUDA Flocking
+
+![Gif of Boids simulation](images/boids.gif)
+
+![Screenshot of Boids simulation](images/boids.png)
+
+### Overview
+
+This is a simulation of the flocking behavior of an idealized species referred to as "Boids". While there are many other behavioral forces that could be added, this simulation only considers cohesion, spacing, and momentum. More precisely, an individual boid attempts to stick near the rest of the flock, maintain a small distance from other boids to avoid collision, and maintain a similar velocity to the local flock. 
+
+### Questions
+
+1) _For each implementation, how does changing the number of boids affect
+performance? Why do you think this is?_
+
+- Naive implementation: The naive implementation has to check all other boids for velocity contributions in order to update the velocity of a single boid. This is the case even if some are obviously too far away to affect the current boid. So increasing the number of boids linearly increases the time it takes.
+
+- Scattered uniform grid: The uniform grid implementation takes an extra precomputation sorting step that is gpu-accelerated, but as a result we only need to check a constant number of cells per boid (in our case 8 cells). Each cell may technically have an arbitrary amount of boids out of the total n boids, but in practice, the spacing rule keeps the worst case density still quite low.
+Increasing the number of total boids will not increase the local boid density noticeably. This is true up until the entire scene is saturated with boids, at which point the simulation is useless anyhow. Since the runtime complexity of the uniform grid approach is a linear function of the local density, the performance for the velocity calculation practically does not deteriorate with more boids. As the number of boids increases, the overall performance gets worse as a result of the sorting step, cache non-locality, using global memory instead of shared memory, and the limit of gpu core count.
+
+- Coherent uniform grid: The coherent uniform grid appraoch clearly saves all the computation that the normal uniform grid implementation did above. In addition it resolves the cache non-locality we saw since now the position and velocity buffers are reordered by grid cell id during the sorting step. This is so that when we iterate over boids in a cell and in adjacent cells, the retrieved cache line will contain the boid data in the same order that we will be iterating through it. When increasing the number of total boids, the performance will still be bound on the usage of global memory instead of shared memory.
+
+2) _For the coherent uniform grid: did you experience any performance improvements
+with the more coherent uniform grid? Was this the outcome you expected?
+Why or why not?_
+- There did not appear to be a difference in framerate at 50,000 boids when the visuals were enabled, both were hovering around 550 fps. At 70,000 boids with the visualization off the difference was more clear, the uniform grid was struggling at around 8 fps while the coherent uniform grid was comfortably hovering around 290 fps.
+
+3) _Did changing cell width and checking 27 vs 8 neighboring cells affect performance?_
+- Using 8 cells (2x2x2 cube) instead of 27 (3x3x3 cube) was slightly faster not necessarily because it was less cells, but because a 3x3x3 cube has 9 disjoint memory segments that need to be fetched while a 2x2x2 cube only has 4 disjoint memory segments. This is determined by looking at which cells in each cube are contiguous with other cells in memory. Then, we need to see how many disjoint groups (each group containing contiguous cells) make up the entire cube. For example, in the 2x2x2 search cube, there are 4 segments of memory that are disjoint from each other, each of which has 2 cells that are adjacent in memory. In the 3x3x3 case, there are 9 disjoint segments with 3 adjacent cells in each segment. 
+
+### Analysis
+
+![](images/FPS-per-boids.jpg)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
diff --git a/images/FPS-per-boids.jpg b/images/FPS-per-boids.jpg
new file mode 100644
index 0000000..32626b7
Binary files /dev/null and b/images/FPS-per-boids.jpg differ
diff --git a/images/boids.gif b/images/boids.gif
new file mode 100644
index 0000000..3a26430
Binary files /dev/null and b/images/boids.gif differ
diff --git a/images/boids.png b/images/boids.png
new file mode 100644
index 0000000..af0e05a
Binary files /dev/null and b/images/boids.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..08f2cd9 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -59,7 +59,8 @@ void checkCUDAError(const char *msg, int line = -1) {
 ***********************************************/
 
 int numObjects;
-dim3 threadsPerBlock(blockSize);
+dim3 threadsPerBlock(blockSize); 
+dim3 fullBlocksPerGrid;
 
 // LOOK-1.2 - These buffers are here to hold all your boid information.
 // These get allocated for you in Boids::initSimulation.
@@ -70,6 +71,9 @@ glm::vec3 *dev_pos;
 glm::vec3 *dev_vel1;
 glm::vec3 *dev_vel2;
 
+// used to reorder the position buffer for 2.3
+glm::vec3* dev_pos2;
+
 // LOOK-2.1 - these are NOT allocated for you. You'll have to set up the thrust
 // pointers on your own too.
 
@@ -127,9 +131,16 @@ __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, flo
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if (index < N) {
     glm::vec3 rand = generateRandomVec3(time, index);
-    arr[index].x = scale * rand.x;
-    arr[index].y = scale * rand.y;
-    arr[index].z = scale * rand.z;
+    arr[index].x = (scale - maxSpeed) * rand.x;
+    arr[index].y = (scale - maxSpeed) * rand.y;
+    arr[index].z = (scale - maxSpeed) * rand.z;
+  }
+}
+
+__global__ void kernGenerateRandomVelArray(int time, int N, glm::vec3* arr) {
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index < N) {
+    arr[index] = generateRandomVec3(time, index);
   }
 }
 
@@ -138,7 +149,7 @@ __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, flo
 */
 void Boids::initSimulation(int N) {
   numObjects = N;
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+  fullBlocksPerGrid = ((N + blockSize - 1) / blockSize);
 
   // LOOK-1.2 - This is basic CUDA memory management and error checking.
   // Don't forget to cudaFree in  Boids::endSimulation.
@@ -155,6 +166,10 @@ void Boids::initSimulation(int N) {
   kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
     dev_pos, scene_scale);
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
+ 
+  kernGenerateRandomVelArray << <fullBlocksPerGrid, blockSize >> > (2, numObjects,
+    dev_vel1);
+  checkCUDAErrorWithLine("kernGenerateRandomVelArray failed!");
 
   // LOOK-2.1 computing grid params
   gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
@@ -169,6 +184,24 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_pos2, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos2 failed!");
+
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
   cudaDeviceSynchronize();
 }
 
@@ -223,6 +256,113 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * stepSimulation *
 ******************/
 
+struct VelocityCalculationInfo {
+  const glm::vec3* pos;
+  const glm::vec3* vel;
+  const glm::vec3& selfPos;
+  const glm::vec3& selfVel;
+  glm::vec3 perceivedCenter;
+  glm::vec3 perceivedVelocity;
+  glm::vec3 separation;
+  int perceivedCenterNeighbors;
+  int perceivedVelocityNeighbors;
+};
+
+/**
+* Prepares all needed info for the velocity calculation for this boid.
+*/
+__device__ VelocityCalculationInfo initVelocityCalculationInfo(int N, int iSelf, const glm::vec3* pos, const glm::vec3* vel) {
+  return {
+    pos,
+    vel,
+    pos[iSelf],
+    vel[iSelf],
+    glm::vec3(0.0f),
+    glm::vec3(0.0f),
+    glm::vec3(0.0f),
+    0,
+    0
+  };
+}
+
+/**
+* Accumulates the given boid's contribution to the current boid's velocity.
+*/
+__device__ void accumulateBoidContribution(
+    int iOther, 
+    VelocityCalculationInfo& calculationInfo) {
+
+  glm::vec3 iPos = calculationInfo.pos[iOther];
+  const glm::vec3& iVel = calculationInfo.vel[iOther];
+
+  glm::vec3 difference = iPos - calculationInfo.selfPos;
+
+  // take into account boundary wrapping
+  if (difference.x > scene_scale) {
+    difference.x -= 2.0f * scene_scale;
+    iPos.x -= 2.0f * scene_scale;
+  }
+  else if (difference.x < -scene_scale) {
+    difference.x += 2.0f * scene_scale;
+    iPos.x += 2.0f * scene_scale;
+  }
+
+  if (difference.y > scene_scale) {
+    difference.y -= 2.0f * scene_scale;
+    iPos.y -= 2.0f * scene_scale;
+  }
+  else if (difference.y < -scene_scale) {
+    difference.y += 2.0f * scene_scale;
+    iPos.y += 2.0f * scene_scale;
+  }
+
+  if (difference.z > scene_scale) {
+    difference.z -= 2.0f * scene_scale;
+    iPos.z -= 2.0f * scene_scale;
+  }
+  else if (difference.z < -scene_scale) {
+    difference.z += 2.0f * scene_scale;
+    iPos.z += 2.0f * scene_scale;
+  }
+
+  float distance = glm::length(difference);
+
+  if (distance < rule1Distance) {
+    calculationInfo.perceivedCenter += iPos;
+    ++calculationInfo.perceivedCenterNeighbors;
+  }
+
+  if (distance < rule2Distance) {
+    calculationInfo.separation -= difference;
+  }
+
+  if (distance < rule3Distance) {
+    calculationInfo.perceivedVelocity += iVel;
+    ++calculationInfo.perceivedVelocityNeighbors;
+  }
+}
+
+/**
+* Evaluates the VelocityCalculationInfo to find the change in velocity for the current boid.
+*/
+__device__ glm::vec3 evaluateVelocityCalculationInfo(VelocityCalculationInfo& calculationInfo) {
+  glm::vec3 result(0.0f);
+
+  if (calculationInfo.perceivedCenterNeighbors > 0) {
+    calculationInfo.perceivedCenter /= calculationInfo.perceivedCenterNeighbors;
+    result += rule1Scale * (calculationInfo.perceivedCenter - calculationInfo.selfPos);
+  }
+
+  result += rule2Scale * calculationInfo.separation;
+
+  if (calculationInfo.perceivedVelocityNeighbors > 0) {
+    calculationInfo.perceivedVelocity /= calculationInfo.perceivedVelocityNeighbors;
+    result += rule3Scale * (calculationInfo.perceivedVelocity);// -calculationInfo.selfVel);
+  }
+
+  return result;
+}
+
 /**
 * LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
 * __device__ code can be called from a __global__ context
@@ -233,7 +373,15 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+
+  // Naive implementation
+  VelocityCalculationInfo calculationInfo = initVelocityCalculationInfo(N, iSelf, pos, vel);
+  for (int i = 0; i < N; ++i) {
+    if (i != iSelf) {
+      accumulateBoidContribution(i, calculationInfo);
+    }
+  }
+  return evaluateVelocityCalculationInfo(calculationInfo);
 }
 
 /**
@@ -245,6 +393,14 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= 0 && index < N) {
+    vel2[index] = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+
+    if (glm::length(vel2[index]) > maxSpeed) {
+      vel2[index] = maxSpeed * glm::normalize(vel2[index]);
+    }
+  }
 }
 
 /**
@@ -289,6 +445,12 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= 0 && index < N) {
+    glm::vec3 gridCoords = glm::floor((pos[index] - gridMin) * inverseCellWidth);
+    gridIndices[index] = gridIndex3Dto1D(gridCoords.x, gridCoords.y, gridCoords.z, gridResolution);
+    indices[index] = index;
+  }
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -300,12 +462,33 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
   }
 }
 
+/**
+* Reorders the input buffer as prescribed by the indices and moves it into
+* the output buffer.
+*/
+template <typename T>
+__global__ void kernArrangeBuffer(int N, const int* indices, T* in, T* out) {
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= 0 && index < N) {
+    out[index] = std::move(in[indices[index]]);
+  }
+}
+
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   int *gridCellStartIndices, int *gridCellEndIndices) {
   // TODO-2.1
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index > 0 && index < N) {
+    int lastGridIndex = particleGridIndices[index - 1];
+    int gridIndex = particleGridIndices[index];
+    if (lastGridIndex != gridIndex) {
+      gridCellEndIndices[lastGridIndex] = index;
+      gridCellStartIndices[gridIndex] = index;
+    }
+  }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +505,68 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= 0 && index < N) {
+    int boidIndex = particleArrayIndices[index];
+    glm::vec3 fGridCoords = (pos[boidIndex] - gridMin) * inverseCellWidth;
+    glm::ivec3 iGridCoords(glm::floor(fGridCoords));
+    glm::ivec3 searchDirection(
+        (fGridCoords.x - iGridCoords.x) > 0.5f ? 1 : -1, 
+        (fGridCoords.y - iGridCoords.y) > 0.5f ? 1 : -1,
+        (fGridCoords.z - iGridCoords.z) > 0.5f ? 1 : -1);
+
+    VelocityCalculationInfo calculationInfo = initVelocityCalculationInfo(N, boidIndex, pos, vel1);
+
+    // will preserve cache locality except in the case we need to wrap around the boundaries
+
+    // CHECK 8 CELLS
+    /**/
+    for (int i = 0; i < 2; ++i) {
+      int z = (iGridCoords.z + i * searchDirection.z) % gridResolution;
+      for (int j = 0; j < 2; ++j) {
+        int y = (iGridCoords.y + j * searchDirection.y) % gridResolution;
+        for (int k = 0; k < 2; ++k) {
+          int x = (iGridCoords.x + k * searchDirection.x) % gridResolution;
+          int gridIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+          int gridStart = gridCellStartIndices[gridIndex];
+          int gridEnd = gridCellEndIndices[gridIndex];
+          // check all boids in this cell
+          for (int l = gridStart; l < gridEnd; ++l) {
+            if (l != index) {
+              accumulateBoidContribution(particleArrayIndices[l], calculationInfo);
+            }
+          }
+        }
+      }
+    }/**/
+
+    // CHECK 27 CELLS
+    /** /
+    for (int dz = -1; dz <= 1; ++dz) {
+      int z = (iGridCoords.z + dz) % gridResolution;
+      for (int dy = -1; dy <= 1; ++dy) {
+        int y = (iGridCoords.y + dy) % gridResolution;
+        for (int dx = -1; dx <= 1; ++dx) {
+          int x = (iGridCoords.x + dx) % gridResolution;
+          int gridIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+          int gridStart = gridCellStartIndices[gridIndex];
+          int gridEnd = gridCellEndIndices[gridIndex];
+          // check all boids in this cell
+          for (int i = gridStart; i < gridEnd; ++i) {
+            if (i != index) {
+              accumulateBoidContribution(particleArrayIndices[i], calculationInfo);
+            }
+          }
+        }
+      }
+    }/**/
+
+    vel2[boidIndex] = vel1[boidIndex] + evaluateVelocityCalculationInfo(calculationInfo);
+
+    if (glm::length(vel2[boidIndex]) > maxSpeed) {
+      vel2[boidIndex] = maxSpeed * glm::normalize(vel2[boidIndex]);
+    }
+  }
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,6 +586,67 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= 0 && index < N) {
+    glm::vec3 fGridCoords = (pos[index] - gridMin) * inverseCellWidth;
+    glm::ivec3 iGridCoords(glm::floor(fGridCoords));
+    glm::ivec3 searchDirection(
+      (fGridCoords.x - iGridCoords.x) > 0.5f ? 1 : -1,
+      (fGridCoords.y - iGridCoords.y) > 0.5f ? 1 : -1,
+      (fGridCoords.z - iGridCoords.z) > 0.5f ? 1 : -1);
+
+    VelocityCalculationInfo calculationInfo = initVelocityCalculationInfo(N, index, pos, vel1);
+
+    // will preserve cache locality except in the case we need to wrap around the boundaries
+
+    // CHECK 8 CELLS
+    /**/
+    for (int i = 0; i < 2; ++i) {
+      int z = (iGridCoords.z + i * searchDirection.z) % gridResolution;
+      for (int j = 0; j < 2; ++j) {
+        int y = (iGridCoords.y + j * searchDirection.y) % gridResolution;
+        for (int k = 0; k < 2; ++k) {
+          int x = (iGridCoords.x + k * searchDirection.x) % gridResolution;
+          int gridIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+          int gridStart = gridCellStartIndices[gridIndex];
+          int gridEnd = gridCellEndIndices[gridIndex];
+          // check all boids in this cell
+          for (int l = gridStart; l < gridEnd; ++l) {
+            if (l != index) {
+              accumulateBoidContribution(l, calculationInfo);
+            }
+          }
+        }
+      }
+    }/**/
+
+    // CHECK 27 CELLS
+    /** /
+    for (int dz = -1; dz <= 1; ++dz) {
+      int z = (iGridCoords.z + dz) % gridResolution;
+      for (int dy = -1; dy <= 1; ++dy) {
+        int y = (iGridCoords.y + dy) % gridResolution;
+        for (int dx = -1; dx <= 1; ++dx) {
+          int x = (iGridCoords.x + dx) % gridResolution;
+          int gridIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+          int gridStart = gridCellStartIndices[gridIndex];
+          int gridEnd = gridCellEndIndices[gridIndex];
+          // check all boids in this cell
+          for (int i = gridStart; i < gridEnd; ++i) {
+            if (i != index) {
+              accumulateBoidContribution(particleArrayIndices[i], calculationInfo);
+            }
+          }
+        }
+      }
+    }/**/
+
+    vel2[index] = vel1[index] + evaluateVelocityCalculationInfo(calculationInfo);
+
+    if (glm::length(vel2[index]) > maxSpeed) {
+      vel2[index] = maxSpeed * glm::normalize(vel2[index]);
+    }
+  }
 }
 
 /**
@@ -349,6 +655,14 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+
+  kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelocityBruteForce failed!");
+
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+  std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +678,54 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+
+  // Both start and end point to the end of the boids array. Using this as the default value lets us avoid having
+  // to set the end pointer of the last sorted grid cell.
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, numObjects);
+  checkCUDAErrorWithLine("Clearing dev_gridCellStartIndices failed!");
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, numObjects);
+  checkCUDAErrorWithLine("Clearing dev_gridCellEndIndices failed!");
+
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (
+      numObjects, 
+      gridSideCount, 
+      glm::vec3(-scene_scale), 
+      1.0f / gridCellWidth, 
+      dev_pos, 
+      dev_particleArrayIndices, 
+      dev_particleGridIndices);
+  checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+  thrust::sort_by_key(
+    dev_thrust_particleGridIndices, 
+    dev_thrust_particleGridIndices + numObjects, 
+    dev_thrust_particleArrayIndices);
+
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (
+      numObjects, 
+      dev_particleGridIndices, 
+      dev_gridCellStartIndices, 
+      dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+  kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (
+      numObjects, 
+      gridSideCount, 
+      glm::vec3(-scene_scale), 
+      1.0f / gridCellWidth, 
+      gridCellWidth,
+      dev_gridCellStartIndices,
+      dev_gridCellEndIndices,
+      dev_particleArrayIndices,
+      dev_pos, 
+      dev_vel1, 
+      dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+  std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +744,60 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, numObjects);
+  checkCUDAErrorWithLine("Clearing dev_gridCellStartIndices failed!");
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, numObjects);
+  checkCUDAErrorWithLine("Clearing dev_gridCellEndIndices failed!");
+
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (
+    numObjects,
+    gridSideCount,
+    glm::vec3(-scene_scale),
+    1.0f / gridCellWidth,
+    dev_pos,
+    dev_particleArrayIndices,
+    dev_particleGridIndices);
+  checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+  thrust::sort_by_key(
+    dev_thrust_particleGridIndices,
+    dev_thrust_particleGridIndices + numObjects,
+    dev_thrust_particleArrayIndices);
+
+  // now apply the new order to the position and velocity buffers
+  kernArrangeBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_pos2);
+  checkCUDAErrorWithLine("rearranging position buffer (kernArrangeBuffer) failed!");
+  kernArrangeBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("rearranging velocity buffer (kernArrangeBuffer) failed!");
+
+  std::swap(dev_pos, dev_pos2);
+  std::swap(dev_vel1, dev_vel2);
+
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (
+    numObjects,
+    dev_particleGridIndices,
+    dev_gridCellStartIndices,
+    dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+  kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (
+    numObjects,
+    gridSideCount,
+    glm::vec3(-scene_scale),
+    1.0f / gridCellWidth,
+    gridCellWidth,
+    dev_gridCellStartIndices,
+    dev_gridCellEndIndices,
+    dev_pos,
+    dev_vel1,
+    dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+  std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::endSimulation() {
@@ -390,6 +806,11 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_pos2);
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..d67d91b 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,12 +13,12 @@
 // ================
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
-#define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define VISUALIZE 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 25000;
 const float DT = 0.2f;
 
 /**