diff --git a/README.md b/README.md
index d63a6a1..dc83e8d 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,74 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Ashley Alexander-Lee
+  * [LinkedIn](linkedin.com/in/asalexanderlee), [Personal Website](https://asalexanderlee.myportfolio.com/)
+* Tested on: Windows 10, i7-6700 @ 3.40GHz 16GB, Quadro P1000 (Moore 100A Lab)
 
-### (TODO: Your README)
+### Project Description
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![Flocking Simulation GIF](/images/coherent.gif)
+
+This project involves creating a flocking simulation, where "boids", or particles, behave according to three rules:
+
+1. **cohesion** - boids move towards the perceived center of mass of their neighbors
+2. **separation** - boids avoid getting to close to their neighbors
+3. **alignment** - boids generally try to move with the same direction and speed as their neighbors
+
+These three rules help dictate the velocity of the boids at each timestep. The 3D flocking logic is based off of this [2D implementation in Processing](http://studio.sketchpad.cc/sp/pad/view/ro.9cbgCRcgbPOI6/rev.23). 
+
+I've implemented this simulation using three different methods:
+
+1. **NAIVE** - Each boid checks every other boid in the simulation against its neighborhood to compute its resulting velocity.
+
+2. **UNIFORM GRID** - Each boid is pre-processed into a grid cell that has the width of a neighborhood * 2. The boid just needs to check the boids in 8 surrounding cells against its neighborhood to compute its resulting velocity. Four arrays are used to keep track of this information:
+   * `particleGridIndices`: each boid's corresponding cell; sorted with `particleArrayIndices` so that the cells are contiguous
+   * `particleArrayIndices`: the original boid indices, pointers to a boid's position and velocity -- sorted using `particleGridIndices` as the key
+   * `cellStartIndices`: the start index of each cell
+   * `cellEndIndices`: the end index of each cell
+
+3. **COHERENT GRID** - Like above, except that the boid positions and velocities are sorted so that the cells are contiguous in memory. 
+
+### Performance Analysis
+
+#### Boids vs. FPS
+I've charted below how the FPS declines as the numbers of boids increases (with the visualization on and off).
+
+![How Boids Affect FPS](/images/howBoidsAffectFPS1.png)
+![How Boids Affect FPS](/images/howBoidsAffectFPS2.png)
+
+As you can see, the FPS declines exponentially as the number of boids increases. There is also a significant difference in performance between the naive method and the grid methods. This makes sense, since the naive method runs in `O(n^2)` time, while the grid method runs in `O(n*m)` time, where m refers to some subset of boids (the worst case could be `O(n^2)` if all boids are in the same cells). Finally, you can observe the slight performance improvement in the coherent grid method over the uniform grid method, likely due to the cell data being contiguous in memory. 
+
+#### Block Size vs. FPS
+You can see a mild falloff in FPS as the block size increases, but not a significant one. The **block size** affects how many threads run together, and is guaranteed to run on the same SM. It makes sense that there is no significant performance change, since none of the threads rely on each other. So, it doesn't matter whether 1024 threads run as 32 blocks of 32 threads or 1 block of 1024 threads.
+
+| Block Size | FPS |
+| ---------- |---- |
+| 32 | 1720 |
+| 64 | 1750 |
+| 128 | 1741 |
+| 256 | 1737 |
+| 512 | 1735 |
+| 1024 | 1723 |
+
+#### Question Responses
+
+**Q: For each implementation how does changing the number of boids affect performance? Why do you think this is?**
+
+As you can see in the charts I included above, the framerate decreases exponentially as the number of boids increases. This makes sense, since for each boid, you have to check a number of neighbors. As the number of boids increases, there are more boids to factor into each boid's velocity calculation. It also makes sense that the framerate plunges more quickly for the naive method, since it runs in `O(n^2)` time, while the grid methods (though technically running in `O(n^2)` worst-case time) run in approximately `O(n * 8(n / numGridCells))`. 
+
+**Q: For each implementation, how does changing the block count and block size affect performance? Why do you think this is?**
+
+I explain this briefly above, but I didn't notice much change in the FPS as the block size changed. I believe this is because none of the threads depend on each other, therefore it doesn't matter how they are separated into blocks. 
+
+**Q: For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?**
+
+I did -- the performance was better, as expected. I imagine the difference is due to the fact that global memory access is slow, and the sorting of the positions and velocities for the coherent uniform grid leads to more contiguous memory accesses. This means that more global memory is likely to be read at once, saving time.
+
+**Q: Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!**
+
+Interestingly enough, changing the cell width and checking 27 neighbors improved performance slightly. If you break it down, you can see (in the chart below) that decreasing the cell width to just the neighborhood distance greatly increases performance, while changing the number of cells checked from 8 to 27 greatly decreases the performance. The cell width change likely improved performance so much because there were fewer neighbors to check for each boid. Also, increasing the number of cells checked from 8 to 27 likely decreased performance so much due to the increased number of boids each boid has to check. It makes sense that the two changes would even out in performance, since the number of boids the cell width change removed was probably fairly even with the number the 27-cell check added.
+
+| Cell Width | Cell Width / 2 | 8-Cell Check | 27-Cell Check | Cell Width / 2 + 27-Cell Check |
+| ---------- | -------------- | ------------ | ------------- | ------------------------------ |
+| 1740 FPS   | 1912 FPS       | 1740 FPS     | 1330 FPS      | 1756 FPS                       |
diff --git a/images/coherent.gif b/images/coherent.gif
new file mode 100644
index 0000000..967d90f
Binary files /dev/null and b/images/coherent.gif differ
diff --git a/images/coherent.png b/images/coherent.png
new file mode 100644
index 0000000..33756e9
Binary files /dev/null and b/images/coherent.png differ
diff --git a/images/howBoidsAffectFPS1.png b/images/howBoidsAffectFPS1.png
new file mode 100644
index 0000000..fbcf386
Binary files /dev/null and b/images/howBoidsAffectFPS1.png differ
diff --git a/images/howBoidsAffectFPS2.png b/images/howBoidsAffectFPS2.png
new file mode 100644
index 0000000..c700485
Binary files /dev/null and b/images/howBoidsAffectFPS2.png differ
diff --git a/images/naive.gif b/images/naive.gif
new file mode 100644
index 0000000..545802c
Binary files /dev/null and b/images/naive.gif differ
diff --git a/images/naive.png b/images/naive.png
new file mode 100644
index 0000000..581ff4b
Binary files /dev/null and b/images/naive.png differ
diff --git a/images/uniform.gif b/images/uniform.gif
new file mode 100644
index 0000000..6b2e001
Binary files /dev/null and b/images/uniform.gif differ
diff --git a/images/uniform.png b/images/uniform.png
new file mode 100644
index 0000000..e41e638
Binary files /dev/null and b/images/uniform.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..d01f71c 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -83,8 +83,8 @@ thrust::device_ptr<int> dev_thrust_particleGridIndices;
 int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
 int *dev_gridCellEndIndices;   // to this cell?
 
-// TODO-2.3 - consider what additional buffers you might need to reshuffle
-// the position and velocity data to be coherent within cells.
+glm::vec3 *dev_sortedPos;
+glm::vec3 *dev_sortedVel1;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -168,7 +168,18 @@ void Boids::initSimulation(int N) {
   gridMinimum.y -= halfGridWidth;
   gridMinimum.z -= halfGridWidth;
 
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+
+  cudaMalloc((void**)&dev_sortedPos, N * sizeof(glm::vec3));
+  cudaMalloc((void**)&dev_sortedVel1, N * sizeof(glm::vec3));
+
   cudaDeviceSynchronize();
 }
 
@@ -218,7 +229,6 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
   cudaDeviceSynchronize();
 }
 
-
 /******************
 * stepSimulation *
 ******************/
@@ -229,22 +239,68 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * Compute the new velocity on the body with index `iSelf` due to the `N` boids
 * in the `pos` and `vel` arrays.
 */
-__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) 
+{
+	glm::vec3 center = glm::vec3(0.f, 0.f, 0.f);
+	glm::vec3 separate = glm::vec3(0.f, 0.f, 0.f);
+	glm::vec3 cohesion = glm::vec3(0.f, 0.f, 0.f);
+	int neighborCount = 0;
+
+	glm::vec3 boidVel = vel[iSelf];
+	const glm::vec3 &boidPos = pos[iSelf];
+
+	// iterate through all of the boids
+	for (int i = 0; i < N; i++) {
+		if (i == iSelf) continue;
+		const glm::vec3 &neighborPos = pos[i];
+		const glm::vec3 &neighborVel = vel[i];
+
+		// if boid is in this boid's neighborhood, perform flocking calcs
+		float distance = glm::distance(neighborPos, boidPos);
+		if (distance < rule1Distance) {
+			center += neighborPos;
+			neighborCount++;
+
+			if (distance < rule2Distance) {
+				separate -= (neighborPos - boidPos);
+			}
+
+			cohesion += neighborVel;
+		}
+	}
+	if (neighborCount > 0) {
+		center /= neighborCount;
+		boidVel += (center - boidPos) * rule1Scale;
+		boidVel += cohesion * rule3Scale;
+	}
+	boidVel += separate * rule2Scale;
+
+	return boidVel;
 }
 
 /**
-* TODO-1.2 implement basic flocking
+* Implement basic flocking
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+	glm::vec3 *vel1, glm::vec3 *vel2) 
+{
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	// Compute a new velocity based on pos and vel1
+	glm::vec3 newVel = computeVelocityChange(N, index, pos, vel1);
+
+	// Clamp the speed
+	float speed = glm::length(newVel);
+	if (speed > maxSpeed) {
+		newVel = (newVel / speed) * maxSpeed;
+	}
+
+	// Record the new velocity into vel2. Question: why NOT vel1? 
+	  // ANS: we still need previous velocities to compute flocking velocities for other boids
+	vel2[index] = newVel;
 }
 
 /**
@@ -272,23 +328,30 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
   pos[index] = thisPos;
 }
 
-// LOOK-2.1 Consider this method of computing a 1D index from a 3D grid index.
-// LOOK-2.3 Looking at this method, what would be the most memory efficient
-//          order for iterating over neighboring grid cells?
-//          for(x)
-//            for(y)
-//             for(z)? Or some other order?
+// computes a 1D index from a 3D grid index.
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
   return x + y * gridResolution + z * gridResolution * gridResolution;
 }
 
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
-  glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+  glm::vec3 *pos, int *indices, int *gridIndices) 
+{
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	// label each boid with the index of its grid cell
+	glm::vec3 boidPos = pos[index];
+	int gridIndex_x = floor((boidPos.x - gridMin.x) * inverseCellWidth);
+	int gridIndex_y = floor((boidPos.y - gridMin.y) * inverseCellWidth);
+	int gridIndex_z = floor((boidPos.z - gridMin.z) * inverseCellWidth);
+	int gridIndex = gridIndex3Dto1D(gridIndex_x, gridIndex_y, gridIndex_z, gridResolution);
+	gridIndices[index] = gridIndex;
+
+    // Set up a parallel array of integer indices as pointers to the actual
+    // boid data in pos and vel1/vel2
+	indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -300,28 +363,188 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
   }
 }
 
+// Identify the start and end points of each cell
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
-  int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+	int *gridCellStartIndices, int *gridCellEndIndices) 
+{
+	// get thread index
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) return;
+
+	// get the current cell and the next cell (if > length of array, will be -1)
+	int gridCell = particleGridIndices[index];
+	int gridCellAfter = index + 1 >= N ? -1 : particleGridIndices[index + 1];
+
+	// if first in gridIndices array, mark start of cell
+	if (index == 0) {
+		gridCellStartIndices[gridCell] = index;
+	}
+	// otherwise, if the current cell does not equal the cell next in the array,
+	// mark the end index of the current cell and the start index of the next
+	else if (gridCell != gridCellAfter) {
+		gridCellEndIndices[gridCell] = index;
+		if (gridCellAfter != -1) {
+			gridCellStartIndices[gridCellAfter] = index + 1;
+		}
+	}
 }
 
+__device__ glm::vec3 computeVelocityChangeGrid(int x_offset, int y_offset, int z_offset, int gridCell,
+												   const int *startIndices, const int *endIndices, 
+												   int iSelf, const glm::vec3 *pos, const glm::vec3 *vel,
+												   const int *particleArrayIndices, int gridResolution) {
+	glm::vec3 center = glm::vec3(0.f, 0.f, 0.f);
+	glm::vec3 separate = glm::vec3(0.f, 0.f, 0.f);
+	glm::vec3 cohesion = glm::vec3(0.f, 0.f, 0.f);
+	int neighborCount = 0;
+
+	glm::vec3 boidVel = vel[iSelf];
+	const glm::vec3 &boidPos = pos[iSelf];
+
+	int gridCellCount = gridResolution * gridResolution * gridResolution;
+
+	// there are 8 potential neighbor cells
+	for (int i = 0; i < 8; i++) {
+		// find neighbor cell from offset
+		// following calculations set up to go through all combinations of offsets
+		int cell_x = i < 4 ? 0 : x_offset;
+		int cell_y = i % 2 == 0 ? y_offset : 0;
+		int cell_z = i > 1 && i < 6 ? z_offset : 0;
+		int neighborCell = gridCell + gridIndex3Dto1D(cell_x, cell_y, cell_z, gridResolution);
+
+		// if neighbor cell exists, iterate over boids inside
+		if (neighborCell >= 0 && neighborCell < gridCellCount) {
+			int startIndex = startIndices[neighborCell];
+			int endIndex = endIndices[neighborCell];
+
+			for (int j = startIndex; j < endIndex + 1; j++) {
+				if (j == iSelf) continue;
+				int neighborBoid = particleArrayIndices[j];
+				const glm::vec3 &neighborPos = pos[neighborBoid];
+				const glm::vec3 &neighborVel = vel[neighborBoid];
+
+				// if neighbor boid is in neighborhood, perform flocking calcs
+				float distance = glm::distance(neighborPos, boidPos);
+				if (distance < rule1Distance) {
+					center += neighborPos;
+					neighborCount++;
+
+					if (distance < rule2Distance) {
+						separate -= (neighborPos - boidPos);
+					}
+					cohesion += neighborVel;
+				}
+			}
+		}
+	}
+	if (neighborCount > 0) {
+		center /= neighborCount;
+		boidVel += (center - boidPos) * rule1Scale;
+		boidVel += cohesion * rule3Scale;
+	}
+	boidVel += separate * rule2Scale;
+
+	return boidVel;
+}
+
+// Update a boid's velocity using the uniform grid to reduce
+// the number of boids that need to be checked.
 __global__ void kernUpdateVelNeighborSearchScattered(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  int *particleArrayIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	int *particleArrayIndices, 
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) 
+{
+	// get thread index
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) return;
+
+	int iSelf = particleArrayIndices[index];
+	glm::vec3 boidPos = pos[iSelf];
+
+	// Identify the grid cell that this particle is in
+	int gridCell_x = floor((boidPos.x - gridMin.x) * inverseCellWidth);
+	int gridCell_y = floor((boidPos.y - gridMin.y) * inverseCellWidth);
+	int gridCell_z = floor((boidPos.z - gridMin.z) * inverseCellWidth);
+	int gridCell = gridIndex3Dto1D(gridCell_x, gridCell_y, gridCell_z, gridResolution);
+
+	// get offsets from current cell to find 8 potential neighbor cells
+	float neighborhood = cellWidth / 2.f;
+	// based on quadrant boid is in, cell offset will be -1 or 1 in each dimension
+	int x_offset = boidPos.x - gridCell_x < neighborhood ? -1 : 1;
+	int y_offset = boidPos.y - gridCell_y < neighborhood ? -1 : 1;
+	int z_offset = boidPos.z - gridCell_z < neighborhood ? -1 : 1;
+
+	// compute the velocity change given all potential neighboring cells
+	glm::vec3 newVel = computeVelocityChangeGrid(x_offset, y_offset, z_offset, gridCell, gridCellStartIndices, gridCellEndIndices,
+												 iSelf, pos, vel1, particleArrayIndices, gridResolution);
+	// clamp the speed
+	float speed = glm::length(newVel);
+	if (speed > maxSpeed) {
+		newVel = (newVel / speed) * maxSpeed;
+	}
+
+	vel2[iSelf] = newVel;
+}
+
+__device__ glm::vec3 computeVelocityChangeCoherent(
+	int x_start, int y_start, int z_start, int gridCell,
+	const int *startIndices, const int *endIndices,
+	int iSelf, const glm::vec3 *pos, const glm::vec3 *vel,
+	int gridResolution) {
+	glm::vec3 center = glm::vec3(0.f, 0.f, 0.f);
+	glm::vec3 separate = glm::vec3(0.f, 0.f, 0.f);
+	glm::vec3 cohesion = glm::vec3(0.f, 0.f, 0.f);
+	int neighborCount = 0;
+
+	glm::vec3 boidVel = vel[iSelf];
+	const glm::vec3 &boidPos = pos[iSelf];
+
+	int gridCellCount = gridResolution * gridResolution * gridResolution;
+
+	// iterate through 8 potential neighbor cells
+	// x, y, z serve as offsets to current grid cell
+	for (int z = z_start; z < z_start + 2; z++) {
+		for (int y = y_start; y < y_start + 2; y++) {
+			for (int x = x_start; x < x_start + 2; x++) {
+				int neighborCell = gridCell + gridIndex3Dto1D(x, y, z, gridResolution);
+
+				// if neighbor cell exists, iterate over boids inside
+				if (neighborCell >= 0 && neighborCell < gridCellCount) {
+					int startIndex = startIndices[neighborCell];
+					int endIndex = endIndices[neighborCell];
+
+					for (int j = startIndex; j < endIndex + 1; j++) {
+						if (j == iSelf) continue;
+						const glm::vec3 &neighborPos = pos[j];
+						const glm::vec3 &neighborVel = vel[j];
+
+						// if neighbor boid is in neighborhood, perform flocking calcs
+						float distance = glm::distance(neighborPos, boidPos);
+						if (distance < rule1Distance) {
+							center += neighborPos;
+							neighborCount++;
+
+							if (distance < rule2Distance) {
+								separate -= (neighborPos - boidPos);
+							}
+							cohesion += neighborVel;
+						}
+					}
+				}
+			}
+		}
+	}
+
+	if (neighborCount > 0) {
+		center /= neighborCount;
+		boidVel += (center - boidPos) * rule1Scale;
+		boidVel += cohesion * rule3Scale;
+	}
+	boidVel += separate * rule2Scale;
+
+	return boidVel;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,59 +552,141 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+
+	// get thread index
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) return;
+
+	glm::vec3 boidPos = pos[index];
+
+	// Identify the grid cell that this particle is in
+	int gridCell_x = floor((boidPos.x - gridMin.x) * inverseCellWidth);
+	int gridCell_y = floor((boidPos.y - gridMin.y) * inverseCellWidth);
+	int gridCell_z = floor((boidPos.z - gridMin.z) * inverseCellWidth);
+	int gridCell = gridIndex3Dto1D(gridCell_x, gridCell_y, gridCell_z, gridResolution);
+
+	// get offsets from current cell to find 8 potential neighbor cells
+	float neighborhood = cellWidth / 2.f;
+	// based on quadrant boid is in, cell start will be the cur cell_dim offset either by -1 or 0
+	int x_start = boidPos.x - gridCell_x < neighborhood ? -1 : 0; 
+	int y_start = boidPos.y - gridCell_y < neighborhood ? -1 : 0;
+	int z_start = boidPos.z - gridCell_z < neighborhood ? -1 : 0;
+
+	// compute the velocity change given all potential neighboring cells
+	glm::vec3 newVel = computeVelocityChangeCoherent(x_start, y_start, z_start, gridCell, gridCellStartIndices, gridCellEndIndices,
+													 index, pos, vel1, gridResolution);
+	// clamp the speed
+	float speed = glm::length(newVel);
+	if (speed > maxSpeed) {
+		newVel = (newVel / speed) * maxSpeed;
+	}
+	vel2[index] = newVel;
+}
+
+// sort the pos and vel1 arrays given their boid positions in particleArrayIndices
+// helper func to stepSimulationCoherent
+__global__ void kernSortPosVelArrays(int *particleArrayIndices, glm::vec3 *sortedPos, glm::vec3 *sortedVel1,
+									 const glm::vec3 *pos, const glm::vec3 *vel1) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int boid = particleArrayIndices[index];
+	sortedPos[index] = pos[boid];
+	sortedVel1[index] = vel1[boid];
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+	// use the kernels you wrote to step the simulation forward in time.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+
+	// ping-pong the velocity buffers
+	glm::vec3 *tmp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = tmp;
 }
 
+// Uniform Grid Neighbor search using Thrust sort.
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+  
+  // label each particle with its array index as well as its grid index.
+  // TODO? Use 2x width grids.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+														 dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+  // Unstable key sort using Thrust
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, 
+			            dev_thrust_particleArrayIndices);
+
+	// fill all cells with -1 (designates empty cell)
+	dim3 blockNumForCells((gridCellCount + blockSize - 1) / blockSize);
+	dim3 blockSize3(blockSize, blockSize, blockSize);
+	kernResetIntBuffer<<<blockNumForCells, blockSize>>>(gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer<<<blockNumForCells, blockSize>>>(gridCellCount, dev_gridCellEndIndices, -1);
+
+	// Naively unroll the loop for finding the start and end indices of each
+	// cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleGridIndices,
+															   dev_gridCellStartIndices, dev_gridCellEndIndices);
+	// Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchScattered <<<fullBlocksPerGrid, blockSize >>>
+		(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+		 dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+
+  // Update positions
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+
+  // Ping-pong buffers as needed
+	glm::vec3 *tmp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = tmp;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+  // label each particle with its array index as well as its grid index.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+		dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	// Unstable key sort using Thrust
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects,
+		dev_thrust_particleArrayIndices);
+
+	// fill all cells with -1 (designates empty cell)
+	dim3 blockNumForCells((gridCellCount + blockSize - 1) / blockSize);
+	dim3 blockSize3(blockSize, blockSize, blockSize);
+	kernResetIntBuffer << <blockNumForCells, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <blockNumForCells, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+
+	// Naively unroll the loop for finding the start and end indices of each
+	// cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	// sort pos and vel1 into sorted arrays
+	kernSortPosVelArrays <<<fullBlocksPerGrid, blockSize >> > (dev_particleArrayIndices, dev_sortedPos, 
+															   dev_sortedVel1, dev_pos, dev_vel1);
+
+	// Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> >
+		(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, 
+		 dev_gridCellEndIndices, dev_sortedPos, dev_sortedVel1, dev_vel2);
+
+	// Update positions
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_sortedPos, dev_vel2);
+
+	// Ping-pong buffers as needed
+	glm::vec3 *tmpPos = dev_pos;
+	dev_pos = dev_sortedPos;
+	dev_sortedPos = tmpPos;
+
+	glm::vec3 *tmp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = tmp;
 }
 
 void Boids::endSimulation() {
@@ -389,7 +694,12 @@ void Boids::endSimulation() {
   cudaFree(dev_vel2);
   cudaFree(dev_pos);
 
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_sortedPos);
+  cudaFree(dev_sortedVel1);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..3187ed7 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,12 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
+
 const float DT = 0.2f;
 
 /**