diff --git a/README.md b/README.md
index d63a6a1..8c81cc8 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,105 @@
+<h1 align="center">Flocking Simulation w/ CUDA</h1>
+
+<p align="center">  <img width="250" height="250" src="images/boids-small-flocking-2.png"></p>
+
+A flocking simulation based on the **Reynolds Flocking algorithm** that mimics the behavior exhibited when a group of birds, called a flock, are foraging or in flight.
+
+Two levels of optimizations were added to the implementation to significantly improve the performance: **Scattering Uniform Grids** and **Coherent Uniform Grids**.
+
+
+
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Anthony Mansur
+  * https://www.linkedin.com/in/anthony-mansur-ab3719125/
+* Tested on: Windows 10, AMD Ryzen 5 3600, Geforce RTX 2060 Super (personal)
+
+
+# Videos
+![](images/boid-small-2.gif)
+
+High number of boids: https://drive.google.com/file/d/1OR57owe2Tcw4ikUtUmHmEYstSUxrT7kX/view?usp=sharing
+
+# Screenshots 
+
+Below are some screenshots of the boids simulation running. The ones with fewer boids were run with the naive implementation at N=50,000. The ones with more boids were run with the coherent uniform grids at N=300,000.
+
+## Few Boids Noisy
+
+![](images/boids-small-noisy.png)
+
+## Few Boids Flocking 1
+![](images/boids-small-flocking-1.png)
+
+## Few Boids Flocking 2
+![](images/boids-small-flocking-2.png)
+
+## Many Boids Noisy
+![](images/boids-big-noisy.png)
+
+## Many Boids Flocking 1
+![](images/boids-big-flocking-1.png)
+
+## Many Boids Flocking 2
+![](images/boids-big-flocking-2.png)
+
+## Many Boids Flocking 2
+![](images/boids-big-flocking-3.png)
+
+# Performance Analysis
+
+To test the performance of the three implementations, CUDA events were used to measure the execution time of each step function in the simulation (See: https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/)
+
+## Baseline
+To compare the speed of the brute force, scattered uniform grid, and coherent uniform grid implementations, the average fps between 10 second intervals was calculated for a total of 100 seconds, at N = 100,000, DT = 0.2f, and blockSize = 128. As shown in the graphs below, the coherent uniform grid implementation is significantly faster than the scattered, and the scattered is signficantly faster than the brute-force implementation.
+
+![](images/naive-baseline.png)
+
+![](images/scattered-baseline.png)
+
+![](images/coherent-baseline.png)
+
+![](images/comparison-baseline.png)
+
+
+## FPS as a Function of Number of Boids
+Next, we want to see how the fps of each implementation various as we increase the number of boids in the simulation. To do this, we calculate the average fps of the simulation at a 10 second runtime for each implementation. We keep DT = 0.2f and blockSize = 128. For the brute-force implementation, we go from N=10,000 to N=100,000. For the scattered uniform grid implementation, we go from N=100,000 to N = 550,000, and for the coherent uniform grid implementation, we go from N=100,000 to N = 1,000,000.
+
+Although any conclusion made are very empirical, we can see that just how much boids each implementation can take before the fps goes to very small levels. Fun fact, it took N = 2,750,000 to get the coherent implementation to the same fps as the scattered at 550,000.
+
+![](images/naive-boids.png)
+
+![](images/scattered-boids.png)
+
+![](images/coherent-boids.png)
+
+### FPS as a Function of Block Size
+I ran the coherent uniform grid implementation at N=100,000 and varied the block size from 32 to 1024 by an increment of 32. As shown in the plot below, the performance changed noticeably.
+
+![](images/block-size-performance.png)
+
+## Questions
+
+**For each implementation, how does changing the number of boids affect performance? Why do you think this is?**
+
+For all implementions, increasing the number of boids negatively affects performance. This is due to several reasons. Although we run things in parallel, increasing the number of boids in the simulation results in each thread having to do more work. For the naive implementation, as each thread iterates through every boid in the simulation, the performance worsens very quickly as a function of N. For the uniform grid implementations, although we limit the number of boids to iterate over, increasing N increases the number of neighbors each boid has, on average. So, that is why we still see a performance decrease as N increases for these two implementations, although the performance dip is less extreme since we limit the search for each boid.
+
+**For each implementation, how does changing the block count and block size affect performance? Why do you think this is?**
+
+I only tested this for the coherent implementation as doing this analysis took a while given that I did so in an increment of 32 blocks. However, changing the block count and size affected performance noticeably, but not in a very clear trend. The performance varied throughout, but seemed to have increased from 320 threads per block to 800 threads per blocked, and were lower out of this range.
+
+The number of threads per block affect the resources used at each streaming multiprocessor, and as these resources change, we may make more outside calls to memory causing latency, and we may end up with a lower amount of warps at each SM leading to even more latency. Finding which block size maximizes performance is quite difficult, as it being researched by people at the top of the field. But, it's not a surprise to see the results. Obviously, if our block size is too low, since there is a limit as to how many blocks an SM can hold, we will see performance hits. If the number of threads per block increases too much, we won't be able to add more blocks given the limit of threads per SM. If those limited number of blocks use too many resources, there will also be a limit to the number of warps held per SM, leading to more performance hits.
+
+**For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?**
+
+Yes, there are two main reasons why we see a significant performance improvement. First, there is one less call to shared memory by removing the middleman. Second, since the boids are ordered based on proximity, the call to shared memory will be much quicker than before. Although ordering the memory is time-consuming, the benefit exceeds the cost when parallel programming.
+
+**Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!**
+
+Yes. I ran the simulation at N=100,000. When I changed the cell width to half its size, the simulation ran much faster, at 2403 fps compared to 1068fps before the change. This is because when we make the cell grids lower, we have a more fine-grained search for neighboring boids, and we potentially discard boids that we would've checked if the cells were larger. So, even though we have to make more calls to shared memory, we perhaps eliminate additional calls to boids that weren't neighbors.
+
+However, when I ran my simulation at N = 50,000 at scattered uniform grid (coherent was too fast that it broke my simulation at a small N), the fps decreased to 10fps. This is because at this N, the number of neighbors, on average, are smaller so this extra check is disadvantageous. Plus, for scaterred uniform grid, we have the middleman so this extra cell calls are more expensive.
+
 
-### (TODO: Your README)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
diff --git a/images/block-size-performance.png b/images/block-size-performance.png
new file mode 100644
index 0000000..3c3cc87
Binary files /dev/null and b/images/block-size-performance.png differ
diff --git a/images/boid-small-2.gif b/images/boid-small-2.gif
new file mode 100644
index 0000000..f7833a7
Binary files /dev/null and b/images/boid-small-2.gif differ
diff --git a/images/boids-big-flocking-1.png b/images/boids-big-flocking-1.png
new file mode 100644
index 0000000..041e21c
Binary files /dev/null and b/images/boids-big-flocking-1.png differ
diff --git a/images/boids-big-flocking-2.png b/images/boids-big-flocking-2.png
new file mode 100644
index 0000000..37b4961
Binary files /dev/null and b/images/boids-big-flocking-2.png differ
diff --git a/images/boids-big-flocking-3.png b/images/boids-big-flocking-3.png
new file mode 100644
index 0000000..91dd7cf
Binary files /dev/null and b/images/boids-big-flocking-3.png differ
diff --git a/images/boids-big-flocking.png b/images/boids-big-flocking.png
new file mode 100644
index 0000000..c038ef2
Binary files /dev/null and b/images/boids-big-flocking.png differ
diff --git a/images/boids-big-noisy.png b/images/boids-big-noisy.png
new file mode 100644
index 0000000..1db7b86
Binary files /dev/null and b/images/boids-big-noisy.png differ
diff --git a/images/boids-small-flocking-1.png b/images/boids-small-flocking-1.png
new file mode 100644
index 0000000..a15ee6a
Binary files /dev/null and b/images/boids-small-flocking-1.png differ
diff --git a/images/boids-small-flocking-2.png b/images/boids-small-flocking-2.png
new file mode 100644
index 0000000..c20ba96
Binary files /dev/null and b/images/boids-small-flocking-2.png differ
diff --git a/images/boids-small-noisy.png b/images/boids-small-noisy.png
new file mode 100644
index 0000000..45c3e58
Binary files /dev/null and b/images/boids-small-noisy.png differ
diff --git a/images/coherent-baseline.png b/images/coherent-baseline.png
new file mode 100644
index 0000000..299eec6
Binary files /dev/null and b/images/coherent-baseline.png differ
diff --git a/images/coherent-boids.png b/images/coherent-boids.png
new file mode 100644
index 0000000..7f1ec1c
Binary files /dev/null and b/images/coherent-boids.png differ
diff --git a/images/comparison-baseline.png b/images/comparison-baseline.png
new file mode 100644
index 0000000..2c4e3b3
Binary files /dev/null and b/images/comparison-baseline.png differ
diff --git a/images/naive-baseline.png b/images/naive-baseline.png
new file mode 100644
index 0000000..c237d2c
Binary files /dev/null and b/images/naive-baseline.png differ
diff --git a/images/naive-boids.png b/images/naive-boids.png
new file mode 100644
index 0000000..6336ab2
Binary files /dev/null and b/images/naive-boids.png differ
diff --git a/images/scattered-baseline.png b/images/scattered-baseline.png
new file mode 100644
index 0000000..23a53e3
Binary files /dev/null and b/images/scattered-baseline.png differ
diff --git a/images/scattered-boids.png b/images/scattered-boids.png
new file mode 100644
index 0000000..6f42cd9
Binary files /dev/null and b/images/scattered-boids.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..22f882d 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -52,7 +52,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 #define maxSpeed 1.0f
 
 /*! Size of the starting area in simulation space. */
-#define scene_scale 100.0f
+#define scene_scale 100.0f // max it out at 400.0f
 
 /***********************************************
 * Kernel state (pointers are device pointers) *
@@ -83,8 +83,9 @@ thrust::device_ptr<int> dev_thrust_particleGridIndices;
 int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
 int *dev_gridCellEndIndices;   // to this cell?
 
-// TODO-2.3 - consider what additional buffers you might need to reshuffle
-// the position and velocity data to be coherent within cells.
+// For Coherent implementation
+glm::vec3* dev_pos2;
+glm::vec3* dev_vel3;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -168,7 +169,25 @@ void Boids::initSimulation(int N) {
   gridMinimum.y -= halfGridWidth;
   gridMinimum.z -= halfGridWidth;
 
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  // Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_pos2, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos2 failed!");
+
+  cudaMalloc((void**)&dev_vel3, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_vel3 failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -229,22 +248,83 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * Compute the new velocity on the body with index `iSelf` due to the `N` boids
 * in the `pos` and `vel` arrays.
 */
-__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, 
+  const glm::vec3 *vel) {
+    // variables related to self boid
+    glm::vec3 selfPos = pos[iSelf];
+    glm::vec3 selfVel = vel[iSelf];
+
+    // variables for rules
+    glm::vec3 perceivedCenter(0.f); // for cohesion
+    int numOfNeighborsForCohesion = 0;
+    glm::vec3 c(0.f); // for separation
+    glm::vec3 perceivedVelocity(0.f); // for alignment
+    int numOfNeighborsForAlignment = 0;
+
+    // Iterate through all boids and update the vectors declared above
+    for (int i = 0; i < N; i++)
+    {
+        if (i == iSelf)
+          continue;
+
+        glm::vec3 boidPos = pos[i];
+        glm::vec3 boidVel = vel[i];
+
+        float distanceToSelf = glm::distance(boidPos, selfPos);
+
+        // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+        if (distanceToSelf < rule1Distance)
+        {
+          perceivedCenter += boidPos;
+          numOfNeighborsForCohesion++;
+        }
+
+        // Rule 2: boids try to stay a distance d away from each other
+        if (distanceToSelf < rule2Distance)
+          c -= (boidPos - selfPos);
+
+        // Rule 3: boids try to match the speed of surrounding boids
+        if (distanceToSelf < rule3Distance)
+        {
+          perceivedVelocity += boidVel;
+          numOfNeighborsForAlignment++;
+        }
+    }
+
+    // Average the vectors (excluding c)
+    perceivedCenter /= numOfNeighborsForCohesion;
+    perceivedVelocity /= numOfNeighborsForAlignment;
+
+    // Get the final vectors for each rule
+    glm::vec3 cohesion = (perceivedCenter - selfPos) * rule1Scale;
+    glm::vec3 separation = c * rule2Scale;
+    glm::vec3 alignment = perceivedVelocity * rule3Scale;
+    
+    return selfVel + cohesion + separation + alignment;
 }
 
 /**
-* TODO-1.2 implement basic flocking
+* Basic Flocking
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
   // Compute a new velocity based on pos and vel1
+  glm::vec3 velocity = computeVelocityChange(N, index, pos, vel1);
+
   // Clamp the speed
+  float speed = glm::length(velocity);
+  if (speed > maxSpeed)
+  {
+    velocity = maxSpeed * glm::normalize(velocity);
+  }
+
   // Record the new velocity into vel2. Question: why NOT vel1?
+  // Answer: we don't want to overwrite the velocities that are still being
+  // used by the other indices as they compute their velocity changes
+  vel2[index] = velocity;
 }
 
 /**
@@ -278,17 +358,42 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
 //          for(x)
 //            for(y)
 //             for(z)? Or some other order?
+// z-y-x to change the values with the most multiplications the least amount of times
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
   return x + y * gridResolution + z * gridResolution * gridResolution;
 }
 
+__device__ glm::vec3 gridIndex1Dto3D(int i, int gridResolution) {
+  int z = i / (gridResolution * gridResolution);
+  int y = (i - (z * gridResolution * gridResolution)) / gridResolution;
+  int x = i - (z * gridResolution * gridResolution) - (y * gridResolution);
+  return glm::vec3(x, y, z);
+}
+
+__device__ glm::vec3 gridPosToIndex3D(const glm::vec3* gridPos, const glm::vec3* gridMin, float inverseGridWidth)
+{
+  glm::vec3 diff = *gridPos - *gridMin;
+  glm::vec3 gridInx{ 0 };
+  gridInx.x = (int) floor(diff.x * inverseGridWidth);
+  gridInx.y = (int) floor(diff.y * inverseGridWidth);
+  gridInx.z = (int) floor(diff.z * inverseGridWidth);
+  return gridInx;
+}
+
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+    // Label each boid with the index of its grid cell.
+    // Set up a parallel array of integer indices as pointers to the actual
+    // boid data in pos and vel1/vel2
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index < N)
+  {
+    indices[index] = index; 
+    
+    glm::vec3 gridLoc = floor((pos[index] - gridMin) * inverseCellWidth); // 3D grid index
+    gridIndices[index] = gridIndex3Dto1D((int)gridLoc.x, (int)gridLoc.y, (int)gridLoc.z, gridResolution);
+  }
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -302,10 +407,38 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int prevIndex = index - 1;
+
+  if (index >= N)
+    return;
+
+  int cellIndex = particleGridIndices[index];
+  if (index == 0)
+  {
+    // This is the first boid we index
+    gridCellStartIndices[cellIndex] = index;
+  }
+  else if (index == (N - 1))
+  {
+    // This is the last boid we are indexing.
+    gridCellEndIndices[cellIndex] = index;
+  }
+  else
+  {
+    int prevCellIndex = particleGridIndices[prevIndex];
+    if (prevCellIndex != cellIndex)
+    {
+      // This boid is the start of new cell
+      gridCellStartIndices[cellIndex] = index;
+
+      // Previous boid was the last boid of the previous cell
+      gridCellEndIndices[prevCellIndex] = prevIndex;
+    }
+  }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -314,14 +447,115 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   int *gridCellStartIndices, int *gridCellEndIndices,
   int *particleArrayIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+  // Update a boid's velocity using the uniform grid to reduce
   // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  if (index >= N)
+    return;
+
   // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+  float neighborhoodDistance = glm::max(glm::max(rule1Distance, rule2Distance), rule3Distance);
+  glm::vec3 selfPos = pos[index];
+
+  glm::vec3 minNeighbor = selfPos - neighborhoodDistance;
+  glm::vec3 maxNeighbor = selfPos + neighborhoodDistance;
+
+  glm::vec3 minGridPos = gridPosToIndex3D(&minNeighbor, &gridMin, inverseCellWidth);
+  glm::vec3 maxGridPos = gridPosToIndex3D(&maxNeighbor, &gridMin, inverseCellWidth);
+
+  glm::vec3 selfGridPos = gridPosToIndex3D(&selfPos, &gridMin, inverseCellWidth);
+
+  // Boids 
+  glm::vec3 perceivedCenter;
+  glm::vec3 c;
+  glm::vec3 perceivedVelocity;
+  int numOfNeighborsForCohesion = 0;
+  int numOfNeighborsForAlignment = 0;
+  glm::vec3 selfVel = vel1[index];
+
+  for (int z = minGridPos.z; z <= maxGridPos.z; z++)
+  {
+    for (int y = minGridPos.y; y <= maxGridPos.y; y++)
+    {
+      for (int x = minGridPos.x; x <= maxGridPos.x; x++)
+      {
+        // check if out of bounds
+        if (x < 0 || y < 0 || z < 0 || x > gridResolution || y > gridResolution || z > gridResolution)
+          continue;
+
+        // For each cell, read the start/end indices in the boid pointer array.
+        int neighborCellIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+        int startInx = gridCellStartIndices[neighborCellIndex];
+        int endInx = gridCellEndIndices[neighborCellIndex];
+        if (startInx != -1 && endInx != -1)
+        {
+          // Access each boid in the cell and compute velocity change from
+          // the boids rules, if this boid is within the neighborhood distance.
+          for (int i = startInx; i < endInx; i++)
+          {
+            if (i == index)
+              continue;
+
+            int locInx = particleArrayIndices[i];
+
+            glm::vec3 boidPos = pos[locInx];
+            glm::vec3 boidVel = vel1[locInx];
+
+            float distanceToSelf = glm::distance(boidPos, selfPos);
+
+            // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+            if (distanceToSelf < rule1Distance)
+            {
+              perceivedCenter += boidPos;
+              numOfNeighborsForCohesion++;
+            }
+
+            // Rule 2: boids try to stay a distance d away from each other
+            if (distanceToSelf < rule2Distance)
+              c -= (boidPos - selfPos);
+
+            // Rule 3: boids try to match the speed of surrounding boids
+            if (distanceToSelf < rule3Distance)
+            {
+              perceivedVelocity += boidVel;
+              numOfNeighborsForAlignment++;
+            }
+          }
+        }
+      }
+    }
+  }
+
+  glm::vec3 cohesionVel{ 0.f };
+  glm::vec3 separationVel{ 0.f };
+  glm::vec3 alignmentVel{ 0.f };
+
+  if (numOfNeighborsForCohesion > 0)
+  {
+    perceivedCenter /= numOfNeighborsForCohesion;
+    cohesionVel = (perceivedCenter - selfPos) * rule1Scale;
+  }
+
+  separationVel = c * rule2Scale;
+
+  if (numOfNeighborsForAlignment > 0)
+  {
+    perceivedVelocity /= numOfNeighborsForAlignment;
+    alignmentVel = perceivedVelocity * rule3Scale;
+  }
+
+  glm::vec3 boidVelocity = selfVel + cohesionVel + separationVel + alignmentVel;
+
+  // Clamp the speed change before putting the new speed in vel2
+  // TODO: refactor
+  float speed = glm::length(boidVelocity);
+  if (speed > maxSpeed)
+  {
+    boidVelocity = maxSpeed * glm::normalize(boidVelocity);
+  }
+
+  vel2[index] = boidVelocity;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,59 +563,278 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+  // This should be very similar to kernUpdateVelNeighborSearchScattered,
   // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
+
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  if (index >= N)
+    return;
+
   // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
+  float neighborhoodDistance = glm::max(glm::max(rule1Distance, rule2Distance), rule3Distance);
+  glm::vec3 selfPos = pos[index];
+
+  glm::vec3 minNeighbor = selfPos - neighborhoodDistance;
+  glm::vec3 maxNeighbor = selfPos + neighborhoodDistance;
+
+  glm::vec3 minGridPos = gridPosToIndex3D(&minNeighbor, &gridMin, inverseCellWidth);
+  glm::vec3 maxGridPos = gridPosToIndex3D(&maxNeighbor, &gridMin, inverseCellWidth);
+
+  glm::vec3 selfGridPos = gridPosToIndex3D(&selfPos, &gridMin, inverseCellWidth);
+
+  // boids 
+  glm::vec3 perceivedCenter;
+  glm::vec3 c;
+  glm::vec3 perceivedVelocity;
+  int numOfNeighborsForCohesion = 0;
+  int numOfNeighborsForAlignment = 0;
+  glm::vec3 selfVel = vel1[index];
+
+  for (int z = minGridPos.z; z <= maxGridPos.z; z++)
+  {
+    for (int y = minGridPos.y; y <= maxGridPos.y; y++)
+    {
+      for (int x = minGridPos.x; x <= maxGridPos.x; x++)
+      {
+        // check if out of bounds
+        if (x < 0 || y < 0 || z < 0 || x > gridResolution || y > gridResolution || z > gridResolution)
+          continue;
+
+        // - For each cell, read the start/end indices in the boid pointer array.
+        int neighborCellIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+
+        int startInx = gridCellStartIndices[neighborCellIndex];
+        int endInx = gridCellEndIndices[neighborCellIndex];
+        if (startInx != -1 && endInx != -1)
+        {
+          // - Access each boid in the cell and compute velocity change from
+          //   the boids rules, if this boid is within the neighborhood distance.
+          // TODO: refactor
+          for (int i = startInx; i < endInx; i++)
+          {
+            if (i == index)
+              continue;
+
+            glm::vec3 boidPos = pos[i];
+            glm::vec3 boidVel = vel1[i];
+
+            float distanceToSelf = glm::distance(boidPos, selfPos);
+
+            // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+            if (distanceToSelf < rule1Distance)
+            {
+              perceivedCenter += boidPos;
+              numOfNeighborsForCohesion++;
+            }
+
+            // Rule 2: boids try to stay a distance d away from each other
+            if (distanceToSelf < rule2Distance)
+              c -= (boidPos - selfPos);
+
+            // Rule 3: boids try to match the speed of surrounding boids
+            if (distanceToSelf < rule3Distance)
+            {
+              perceivedVelocity += boidVel;
+              numOfNeighborsForAlignment++;
+            }
+          }
+        }
+      }
+    }
+  }
+
+  glm::vec3 cohesionVel{ 0.f };
+  glm::vec3 separationVel{ 0.f };
+  glm::vec3 alignmentVel{ 0.f };
+
+  if (numOfNeighborsForCohesion > 0)
+  {
+    perceivedCenter /= numOfNeighborsForCohesion;
+    cohesionVel = (perceivedCenter - selfPos) * rule1Scale;
+  }
+
+  separationVel = c * rule2Scale;
+
+  if (numOfNeighborsForAlignment > 0)
+  {
+    perceivedVelocity /= numOfNeighborsForAlignment;
+    alignmentVel = perceivedVelocity * rule3Scale;
+  }
+
+  glm::vec3 boidVelocity = selfVel + cohesionVel + separationVel + alignmentVel;
+
   // - Clamp the speed change before putting the new speed in vel2
+  // TODO: refactor
+  float speed = glm::length(boidVelocity);
+  if (speed > maxSpeed)
+  {
+    boidVelocity = maxSpeed * glm::normalize(boidVelocity);
+  }
+
+  vel2[index] = boidVelocity;
+}
+
+__global__ void kernRearrangeBuffer(glm::vec3* dst, glm::vec3* src, int* ref, int size)
+{
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  if (index >= size)
+    return;
+
+  dst[index] = src[ref[index]];
+}
+
+__global__ void kernRevertBuffer(glm::vec3* dst, glm::vec3* src, int* ref, int size)
+{
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  if (index >= size)
+    return;
+
+  dst[ref[index]] = src[index];
 }
 
+
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+  // Use the kernels you wrote to step the simulation forward in time.
+  
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+  // update the velocity vector
+  kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernel kernUpdateVelocityBruteForce with dev_gridCellStartIndices failed!");
+
+  // update the position based on the new velocity vector
+  kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("kernel kernUpdatePos with dev_gridCellStartIndices failed!");
+  
+  // Ping-pong the velocity buffers
+  glm::vec3* temp = dev_vel1;
+  dev_vel1 = dev_vel2;
+  dev_vel2 = temp;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
   // Uniform Grid Neighbor search using Thrust sort.
   // In Parallel:
+
   // - label each particle with its array index as well as its grid index.
   //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(
+    numObjects,
+    gridSideCount,
+    gridMinimum,
+    gridInverseCellWidth,
+    dev_pos,
+    dev_particleArrayIndices,
+    dev_particleGridIndices
+    );
+
+  // Unstable key sort using Thrust. A stable sort isn't necessary, but you
+  // are welcome to do a performance comparison.
+  thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);
+  thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+  thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+
+  // Naively unroll the loop for finding the start and end indices of each
+  // cell's data pointers in the array of boid indices
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+  checkCUDAErrorWithLine("kernel kernResetIntBuffer with dev_gridCellStartIndices failed!");
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+  checkCUDAErrorWithLine("kernel kernResetIntBuffer with dev_gridCellEndIndices failed!");
+
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (
+    numObjects,
+    dev_particleGridIndices,
+    dev_gridCellStartIndices,
+    dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernel kernIdentifyCellStartEnd failed!");
+
+
+  // Perform velocity updates using neighbor search
+  kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (
+    numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+    dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices,
+    dev_pos, dev_vel1, dev_vel2
+    );
+  checkCUDAErrorWithLine("kernel kernUpdateVelNeighborSearchScattered failed!");
+ 
+  // Update positions
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("kernel kernUpdatePos failed!");
+
+  
+  // Ping-pong buffers as needed
+  glm::vec3* temp = dev_vel1;
+  dev_vel1 = dev_vel2;
+  dev_vel2 = temp;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+  // start by copying Boids::stepSimulationNaiveGrid
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
   // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+  // Label each particle with its array index as well as its grid index.
+  // Use 2x width grids.
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (
+    numObjects,
+    gridSideCount,
+    gridMinimum,
+    gridInverseCellWidth,
+    dev_pos,
+    dev_particleArrayIndices,
+    dev_particleGridIndices
+    );
+
+  // Unstable key sort using Thrust. A stable sort isn't necessary, but you
+  // are welcome to do a performance comparison.
+  thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);
+  thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+  thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+
+  // Use dev_particleArrayIndices to Rearrange dev_pos2 and dev_vel2
+  kernRearrangeBuffer << <fullBlocksPerGrid, blockSize >> > (dev_pos2, dev_pos, dev_particleArrayIndices, numObjects);
+  checkCUDAErrorWithLine("kernel kernRearrangeBuffer from dev_pos to dev_pos2 failed!");
+
+  kernRearrangeBuffer << <fullBlocksPerGrid, blockSize >> > (dev_vel2, dev_vel1, dev_particleArrayIndices, numObjects);
+  checkCUDAErrorWithLine("kernel kernRearrangeBuffer from dev_vel1 to dev_vel2 failed!");
+
+  // Naively unroll the loop for finding the start and end indices of each
+  // cell's data pointers in the array of boid indices
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+  checkCUDAErrorWithLine("kernel kernResetIntBuffer with dev_gridCellStartIndices failed!");
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+  checkCUDAErrorWithLine("kernel kernResetIntBuffer with dev_gridCellEndIndices failed!");
+
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (
+    numObjects,
+    dev_particleGridIndices,
+    dev_gridCellStartIndices,
+    dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernel kernIdentifyCellStartEnd failed!");
+
+  // Perform velocity updates using neighbor search
+  kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (
+    numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+    dev_gridCellStartIndices, dev_gridCellEndIndices,
+    dev_pos2, dev_vel2, dev_vel3
+    );
+  checkCUDAErrorWithLine("kernel kernUpdateVelNeighborSearchScattered failed!");
+
+  // Update positions
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos2, dev_vel3);
+  checkCUDAErrorWithLine("kernel kernUpdatePos failed!");
+
+  // Rearrange dev_pos2 and dev_vel3 into dev_pos and dev_vel1
+  kernRevertBuffer << <fullBlocksPerGrid, blockSize >> > (dev_pos, dev_pos2, dev_particleArrayIndices, numObjects);
+  kernRevertBuffer << <fullBlocksPerGrid, blockSize >> > (dev_vel1, dev_vel3, dev_particleArrayIndices, numObjects);
 }
 
 void Boids::endSimulation() {
@@ -389,7 +842,12 @@ void Boids::endSimulation() {
   cudaFree(dev_vel2);
   cudaFree(dev_pos);
 
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  // Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_pos2);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..a93e9c1 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -7,6 +7,7 @@
 */
 
 #include "main.hpp"
+#include <iomanip>
 
 // ================
 // Configuration
@@ -16,11 +17,18 @@
 #define VISUALIZE 1
 #define UNIFORM_GRID 0
 #define COHERENT_GRID 0
+#define TIMING_ANALYSIS 1
+#define UNIT_TEST 1
+
+#define FIXED_FLOAT(x) std::fixed <<std::setprecision(2)<<(x) 
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 50000; // max it out at 3,000,000
 const float DT = 0.2f;
 
+static long numberOfSteps = 0;
+static float timePassed = 0.f;
+
 /**
 * C main function.
 */
@@ -195,6 +203,14 @@ void initShaders(GLuint * program) {
     cudaGLMapBufferObject((void**)&dptrVertPositions, boidVBO_positions);
     cudaGLMapBufferObject((void**)&dptrVertVelocities, boidVBO_velocities);
 
+    #if TIMING_ANALYSIS
+    // timing analysis
+    cudaEvent_t start, stop;
+    cudaEventCreate(&start);
+    cudaEventCreate(&stop);
+
+    cudaEventRecord(start);
+    #endif
     // execute the kernel
     #if UNIFORM_GRID && COHERENT_GRID
     Boids::stepSimulationCoherentGrid(DT);
@@ -203,7 +219,24 @@ void initShaders(GLuint * program) {
     #else
     Boids::stepSimulationNaive(DT);
     #endif
-
+    #if TIMING_ANALYSIS
+    cudaEventRecord(stop);
+
+    cudaEventSynchronize(stop);
+    float milliseconds = 0.f;
+    cudaEventElapsedTime(&milliseconds, start, stop);
+
+    timePassed += milliseconds;
+    numberOfSteps++;
+
+    if (timePassed > 10000)
+    {
+      std::cout << "\nAverage time step: " << FIXED_FLOAT(timePassed / numberOfSteps) << "(ms)" << std::endl;
+      std::cout << "Average frame per second: " << FIXED_FLOAT(1000 * numberOfSteps / timePassed) << "(fps)" << std::endl;
+      timePassed = 0.f;
+      numberOfSteps = 0;
+    }
+    #endif
     #if VISUALIZE
     Boids::copyBoidsToVBO(dptrVertPositions, dptrVertVelocities);
     #endif
@@ -217,8 +250,10 @@ void initShaders(GLuint * program) {
     double timebase = 0;
     int frame = 0;
 
+    #if UNIT_TEST
     Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
+    #endif
 
     while (!glfwWindowShouldClose(window)) {
       glfwPollEvents();