diff --git a/README.md b/README.md
index d63a6a1..ddf2641 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,63 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
+=======================================================================
+* Jie Meng
+  * [LinkedIn](https://www.linkedin.com/in/jie-meng/), [twitter](https://twitter.com/JieMeng6).
+* Tested on: Windows 10, i7-7700HQ @ 2.80GHz, 16GB, GTX 1050 4GB (My personal laptop)
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
 
-### (TODO: Your README)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+### (Result)
+* Flocking (Uniform 2xGrid Scattered @10k boids & 1024 blocksize)
+![Running screenshot](images/screenshot1.png)
+* Gif
+![Running gif](images/3.gif)
+
+### (Performance Analysis)
+**Bechmark Method:**
+
+* Use kernel function call frequency as indicator
+* Calculation method: Launch NSight Performance Analysis, use function call times(under Cuda Summary) divided by running time(substracts launching time, usually around 3s)
+* In one simulation round, the kernel function is called exactly once, so this way we can get average framerate
+
+**Data and Analysis**
+
+Number of boids affecting framerate, NOT visualized
+![Framerate Comparison](images/notvisualized.png)
+* For brutal method, performance is significantly affected by the number of boids: from 5k-25k boids, framerate decreases by a factor of 20. At 1 million boids, the framerate is almost 0.
+* Scatter grid search method works well: its performance is relatively unsensitive (comparing to brutal force) to the change of number of boids: from 5k to 25k, framerate only has slight fluctuation around 245fps. At 1 million boids, it runs at about 3.2fps
+* Coherent grid search method also has a good performance: it is also unsensitive to the change of number of boids: from 5k to 25k, framerate only has slight fluctuation around 235fps. At 1 million boids, it runs at about 5.1fps
+* At 5k boids, brutal method has significantlly larger framerate than the other two, after 10k boids, it performs worse than the other two.
+* Uniform Grid methods have stable performance, and essentially performs better than brutal method. This is because they only search a fixed number of cells, not all boids.
+* If number of grids continues to increase, all methods would slow down
+* At 1 million boids, coherent method is better(5.1fps) than scattered method(3.2fps), this suggests that coherent method performance is relatively unsensitive comparing to scattered method, and would perform better than scattered method at some number of boids(this will be clearer from next graph).
+
+Number of boids affecting framerate, visualized
+![Framerate Comparison2](images/visualized.png)
+This illustrates similar results as above, differences worth pointing out:
+* Even at 5k boids, uniform grid methods are better than brutal force method.
+* Coherent method becomes better than scattered method after at most 50k boids.
+
+Block size affecting framerate, visualized
+![Framerate Comparison3](images/blocksize.png)
+From this graph:
+* Brutal force method is essentially not affected by block size.
+* Scattered method works worse with larger block size.
+* Coherent method works better with larger block size.
+* Coherent method becomes better than scattered method after a block size of 512.
+* The data of uniform methods appears a strange behavior: it's not changing stably, but I think this is because the block size is increasing exponentially.
+
+**Answer to questions:**
+* Framerate decreases when the number of boids increases for all methods, since this would involve more computation.
+Uniform grid methods are better than brutal force method, since much less computation is involved. With 5k boids brutal force method performs better (not visualized) because the number of boids is relatively small, so the extra operations in uniform grids methods slow them down. 
+* Performance varies only slightly when block size changes(with block count changes simultaneously). I'm not sure why they changes this way(see the above graph): the brutal force method is not affected by block size; the scattered method tend to slow down when block size increases; the coherent method tend to speed up when block size increases.
+I didn't figure out why it performs like this, but I guess there is something to do with memory access speed.
+* I expected that the coherent method would always perform better than scattered method, but it is not. At smaller boid counts the coherent method is worse. I believe it is because
+the time spent on copying arrays(reshuffle and shuffle back). But when boid count becomes significantly large, the coherent method is better, since much more boid data is sequential by shuffling, so memory access saves much more time.
+* Under the configuration of (Uniform grid Scattered @ 10k boids & 1024 block size), checking 27 neighboring cells (1xgrid) is slightly faster than 8 cells: 190fps vs. 180fps, I believe this is because the boids count need to check is smaller with 1X grid size, so less data fetch operation is required, therefore performance is better.
+
+### (Code Gists)
+* In `kernel.cu` ,toggle between **GRID1X** and **GRID2X** for difference grid cell size
+* Helper functions: `swapspeed` to ping-ping speed buffer, `shuffleBufferWithIndices` to shuffle position and velocity arrays.
+`kernUpdateVelNeighborSearchScattered1XGRID` and  `kernUpdateVelNeighborSearchCoherent1XGRID` for 1X grid cell width search.
+* 2X grid cell width search is a little bit hard code and may appears arcane.
diff --git a/images/1.jpg b/images/1.jpg
new file mode 100644
index 0000000..72a415b
Binary files /dev/null and b/images/1.jpg differ
diff --git a/images/3.gif b/images/3.gif
new file mode 100644
index 0000000..356e2a3
Binary files /dev/null and b/images/3.gif differ
diff --git a/images/blocksize.png b/images/blocksize.png
new file mode 100644
index 0000000..9670c22
Binary files /dev/null and b/images/blocksize.png differ
diff --git a/images/notvisualized.png b/images/notvisualized.png
new file mode 100644
index 0000000..b2e53b1
Binary files /dev/null and b/images/notvisualized.png differ
diff --git a/images/screenshot1.png b/images/screenshot1.png
new file mode 100644
index 0000000..c6b4145
Binary files /dev/null and b/images/screenshot1.png differ
diff --git a/images/visualized.png b/images/visualized.png
new file mode 100644
index 0000000..24705bc
Binary files /dev/null and b/images/visualized.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..0839c71 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -5,6 +5,11 @@
 #include <glm/glm.hpp>
 #include "utilityCore.hpp"
 #include "kernel.h"
+#include <device_launch_parameters.h>
+
+// 2.1 & 2.3 toggle between 1X & 2X grid size
+#define GRID1X 1
+#define GRID2X 0
 
 // LOOK-2.1 potentially useful for doing grid-based neighbor search
 #ifndef imax
@@ -37,7 +42,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 *****************/
 
 /*! Block size used for CUDA kernel launch. */
-#define blockSize 128
+#define blockSize 1024
 
 // LOOK-1.2 Parameters for the boids algorithm.
 // These worked well in our reference implementation.
@@ -86,6 +91,10 @@ int *dev_gridCellEndIndices;   // to this cell?
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
 
+glm::vec3* dev_posBuffer;
+glm::vec3* dev_vel1Buffer;
+glm::vec3* dev_vel2Buffer;
+
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
 int gridCellCount;
@@ -157,7 +166,13 @@ void Boids::initSimulation(int N) {
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
+#if GRID2X
   gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#endif // 2XGRID
+#if GRID1X
+  gridCellWidth = 1.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#endif // 1XGRID
+
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -170,6 +185,31 @@ void Boids::initSimulation(int N) {
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
   cudaDeviceSynchronize();
+  // DO2.1
+  // allocate memory for 
+//  int *dev_particleArrayIndices; 
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArray failed!");
+//  int *dev_particleGridIndices; 
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+//  int *dev_gridCellStartIndices; 
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+//  int *dev_gridCellEndIndices;  
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  //DO2.3
+  //allocate memory for shuffle buffers
+  cudaMalloc((void**)&dev_posBuffer, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_posbuffer failed!");
+  cudaMalloc((void**)&dev_vel1Buffer, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_vel1buffer failed!");
+  cudaMalloc((void**)&dev_vel2Buffer, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_vel2buffer failed!");
+
 }
 
 
@@ -210,8 +250,8 @@ __global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float
 void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
   dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
 
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+  kernCopyPositionsToVBO<<<fullBlocksPerGrid, blockSize >>>(numObjects, dev_pos, vbodptr_positions, scene_scale);
+  kernCopyVelocitiesToVBO<<<fullBlocksPerGrid, blockSize >>>(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
 
   checkCUDAErrorWithLine("copyBoidsToVBO failed!");
 
@@ -243,8 +283,69 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
   // Compute a new velocity based on pos and vel1
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	glm::vec3 vel_change(0.0f, 0.0f, 0.0f);
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	if (index >= N) {
+		return;
+	}
+	//the pos of current boid
+	glm::vec3 thisPos = pos[index];
+	//compute the perceived center of mass
+	glm::vec3 perceived_CM(0.0f, 0.0f, 0.0f);
+	float perceived_Num1 = 0.0f;
+	// Rule 2: boids try to stay a distance d away from each other
+	glm::vec3 c(0.0f, 0.0f, 0.0f);
+	// Rule 3: boids try to match the speed of surrounding boids
+	glm::vec3 perceived_Vel(0.0f, 0.0f, 0.0f);
+	float perceived_Num3 = 0.0f;
+	for (int index_i = 0; index_i < N; index_i++)
+	{
+		if (index_i != index)
+		{
+			glm::vec3 another_Pos = pos[index_i];
+			float disFromThis = glm::distance(thisPos, another_Pos);
+			if (disFromThis < rule1Distance)
+			{
+				perceived_CM += another_Pos;
+				perceived_Num1 += 1.0f;
+			}
+			if (disFromThis < rule2Distance)
+			{
+				c -= (another_Pos - thisPos);
+			}
+			if (disFromThis < rule3Distance)
+			{
+				perceived_Vel += vel1[index_i];
+				perceived_Num3 += 1.0f;
+			}
+		}
+	}
+	//rule1
+	if (perceived_Num1 > 0)
+	{
+		perceived_CM = perceived_CM * (1.0f / perceived_Num1);
+		vel_change += (perceived_CM - thisPos) * rule1Scale;
+	}
+	//rule2
+	vel_change += c * rule2Scale;
+	//rule3
+	if (perceived_Num3 > 0)
+	{
+		perceived_Vel = perceived_Vel * (1.0f / perceived_Num3);
+		vel_change += perceived_Vel * rule3Scale;
+	}
   // Clamp the speed
+	glm::vec3 newVel = vel1[index] + vel_change;
+	newVel.x = newVel.x < -maxSpeed ? -maxSpeed : newVel.x;
+	newVel.y = newVel.y < -maxSpeed ? -maxSpeed : newVel.y;
+	newVel.z = newVel.z < -maxSpeed ? -maxSpeed : newVel.z;
+
+	newVel.x = newVel.x > maxSpeed ? maxSpeed : newVel.x;
+	newVel.y = newVel.y > maxSpeed ? maxSpeed : newVel.y;
+	newVel.z = newVel.z > maxSpeed ? maxSpeed : newVel.z;
   // Record the new velocity into vel2. Question: why NOT vel1?
+	vel2[index] = newVel; 
 }
 
 /**
@@ -289,6 +390,17 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N)
+	{
+		return;
+	}
+	glm::vec3 currPos = pos[index];
+	int x = (int)(currPos.x * inverseCellWidth) + gridResolution / 2;
+	int y = (int)(currPos.y * inverseCellWidth) + gridResolution / 2;
+	int z = (int)(currPos.z * inverseCellWidth) + gridResolution / 2;
+	gridIndices[index] = gridIndex3Dto1D(x, y, z, gridResolution);
+	indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +418,34 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+	int indexThread = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (indexThread >= N)
+	{
+		return;
+	}
+	int gridIdx = particleGridIndices[indexThread];
+
+	//the first 
+	if (indexThread == 0)
+	{
+		gridCellStartIndices[gridIdx] = indexThread;
+	}
+
+	if (indexThread > 0)
+	{
+		int gridIdxPre = particleGridIndices[indexThread - 1];
+		//this cell is dfferent from the previous cell, must be a new grid
+		if (indexThread == N - 1)
+		{
+			gridCellEndIndices[gridIdx] = N - 1;
+		}
+		if (gridIdx != gridIdxPre )
+		{
+			gridCellStartIndices[gridIdx] = indexThread;
+			gridCellEndIndices[gridIdxPre] = indexThread - 1;
+		}
+	}
+	
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -316,12 +456,369 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
   // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
   // the number of boids that need to be checked.
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N)
+	{
+		return;
+	}
   // - Identify the grid cell that this particle is in
   // - Identify which cells may contain neighbors. This isn't always 8.
+	int gridstoSee[8];
+	//neighboring cells
+	int xflag, yflag, zflag;
+	int boidPosVelInd = particleArrayIndices[index];
+	glm::vec3 posCurr = pos[boidPosVelInd];
+	int gridX = int(floor(posCurr.x * inverseCellWidth));
+	int gridY = int(floor(posCurr.y * inverseCellWidth));
+	int gridZ = int(floor(posCurr.z * inverseCellWidth));
+	float xfloor = gridX * cellWidth;
+	float yfloor = gridY * cellWidth;
+	float zfloor = gridZ * cellWidth;
+
+	float xDecimalRatio = (posCurr.x - xfloor) / cellWidth;
+	float yDecimalRatio = (posCurr.y - yfloor) / cellWidth;
+	float zDecimalRatio = (posCurr.z - zfloor) / cellWidth;
+
+	int halfSideCount = gridResolution / 2;
+	int gridSideCount = gridResolution;
+
+	//the first grid is the current grid
+	gridstoSee[0] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount, gridZ + halfSideCount, gridSideCount);
+	//////////////x
+	if (xDecimalRatio <= 0.5)
+	{
+		if (gridX >= -halfSideCount + 1)
+		{
+			gridstoSee[1] = gridIndex3Dto1D(gridX + halfSideCount - 1, gridY + halfSideCount, gridZ + halfSideCount, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[1] = gridIndex3Dto1D(gridSideCount - 1, gridY + halfSideCount, gridZ + halfSideCount, gridSideCount);
+		}
+		xflag= 0;
+	}
+	else
+	{
+		if (gridX < halfSideCount-1)
+		{
+			gridstoSee[1] = gridIndex3Dto1D(gridX + halfSideCount + 1, gridY + halfSideCount, gridZ + halfSideCount, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[1] = gridIndex3Dto1D(0, gridY + halfSideCount, gridZ + halfSideCount, gridSideCount);
+		}
+		xflag = 1;
+	}
+	////////////////y
+	if (yDecimalRatio <= 0.5)
+	{
+		if (gridY >= -halfSideCount + 1)
+		{
+			gridstoSee[2] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount - 1, gridZ + halfSideCount, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[2] = gridIndex3Dto1D(gridX + halfSideCount, gridSideCount - 1, gridZ + halfSideCount, gridSideCount);
+		}
+		yflag = 0;
+	}
+	else
+	{
+		if (gridY < halfSideCount - 1)
+		{
+			gridstoSee[2] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount + 1, gridZ + halfSideCount, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[2] = gridIndex3Dto1D(gridX + halfSideCount, 0, gridZ + halfSideCount, gridSideCount);
+		}
+		yflag = 1;
+	}
+	/////////////////////z
+	if (zDecimalRatio <= 0.5)
+	{
+		if (gridZ >= -halfSideCount + 1)
+		{
+			gridstoSee[3] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount, gridZ + halfSideCount - 1, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[3] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount, gridSideCount - 1, gridSideCount);
+		}
+		zflag = 0;
+	}
+	else
+	{
+		if (gridZ < halfSideCount - 1)
+		{
+			gridstoSee[3] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount, gridZ + halfSideCount + 1, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[3] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount, 0, gridSideCount);
+		}
+		zflag = 1;
+	}
+	// lets check for diagonal directions
+	//xy
+
+	float xyIndicator = (xDecimalRatio - xflag) * (xDecimalRatio - xflag) + (yDecimalRatio - yflag) * (yDecimalRatio - yflag);
+	if (xyIndicator <= 0.25)
+	{
+		if (xflag == 0)
+			xflag = -1;
+		if (yflag == 0)
+			yflag = -1;
+		gridstoSee[4] = gridIndex3Dto1D((gridX + halfSideCount + xflag + gridSideCount) % gridSideCount, 
+			(gridY + halfSideCount + yflag + gridSideCount) % gridSideCount, 
+			gridZ + halfSideCount, gridSideCount);
+	}
+	else
+		gridstoSee[4] = -1;
+	//xz
+
+	if (xflag < 0)
+		xflag = 0;
+	float xzIndicator = (xDecimalRatio - xflag) * (xDecimalRatio - xflag) + (zDecimalRatio - zflag) * (zDecimalRatio - zflag);
+	if (xzIndicator <= 0.25)
+	{
+		if (!xflag)
+			xflag = -1;
+		if (!zflag)
+			zflag = -1;
+		gridstoSee[5] = gridIndex3Dto1D((gridX + halfSideCount + xflag + gridSideCount) % gridSideCount, gridY + halfSideCount, (gridZ + halfSideCount + zflag + gridSideCount) % gridSideCount, gridSideCount);
+	}
+	else
+		gridstoSee[5] = -1;
+	
+	//yz
+
+	if (yflag < 0)
+		yflag = 0;
+	if (zflag < 0)
+		zflag = 0;
+	float yzIndicator = (zDecimalRatio - zflag) * (zDecimalRatio - zflag) + (yDecimalRatio - yflag) * (yDecimalRatio - yflag);
+	if (yzIndicator <= 0.25)
+	{
+		if (!zflag)
+			zflag = -1;
+		if (!yflag)
+			yflag = -1;
+		gridstoSee[6] = gridIndex3Dto1D(gridX + halfSideCount, (gridY + halfSideCount + yflag + gridSideCount) % gridSideCount, (gridZ + halfSideCount + zflag + gridSideCount) % gridSideCount, gridSideCount);
+	}
+	else
+		gridstoSee[6] = -1;
+
+	//xyz
+	xflag = xflag < 0 ? 0 : xflag;
+	yflag = yflag < 0 ? 0 : yflag;
+	zflag = zflag < 0 ? 0 : zflag;
+	float xyzIndicator = (xDecimalRatio - xflag) * (xDecimalRatio - xflag) 
+		+ (yDecimalRatio - yflag) * (yDecimalRatio - yflag)
+		+ (zDecimalRatio - zflag) * (zDecimalRatio - zflag);
+	if (xyzIndicator <= 0.25)
+	{
+		if (!zflag)
+			zflag = -1;
+		if (!yflag)
+			yflag = -1;
+		if (!xflag)
+			xflag = -1;
+		gridstoSee[7] = gridIndex3Dto1D((gridX + halfSideCount + xflag + gridSideCount) % gridSideCount, (gridY + halfSideCount + yflag + gridSideCount) % gridSideCount, (gridZ + halfSideCount + zflag + gridSideCount) % gridSideCount, gridSideCount);
+
+	}
+	else
+		gridstoSee[7] = -1;
+
   // - For each cell, read the start/end indices in the boid pointer array.
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
+	glm::vec3 vel_change(0.0f, 0.0f, 0.0f);
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	//compute the perceived center of mass
+	glm::vec3 perceived_CM(0.0f, 0.0f, 0.0f);
+	float perceived_Num1 = 0.0f;
+	// Rule 2: boids try to stay a distance d away from each other
+	glm::vec3 c(0.0f, 0.0f, 0.0f);
+	// Rule 3: boids try to match the speed of surrounding boids
+	glm::vec3 perceived_Vel(0.0f, 0.0f, 0.0f);
+	float perceived_Num3 = 0.0f;
+
+	for (int tocheckInd = 0; tocheckInd < 8; tocheckInd++)
+	{
+		int gridToCheck = gridstoSee[tocheckInd];
+		if (gridToCheck >= 0)
+		{
+			//read the start/end indices 
+			int startInd = gridCellStartIndices[gridToCheck];
+			int endInd = gridCellEndIndices[gridToCheck];
+			//access boids and compute velocity change
+			//use boidPtrInd>=0 to surpass those cells with no boids inside
+
+			for (int boidPtrInd = startInd; boidPtrInd <= endInd && boidPtrInd >= 0 ; boidPtrInd++)
+			{
+				int boidInd = particleArrayIndices[boidPtrInd];
+				if (boidInd != boidPosVelInd)
+				{
+					glm::vec3 another_Pos = pos[boidInd];
+					float disFromThis = glm::distance(posCurr, another_Pos);
+					if (disFromThis < rule1Distance)
+					{
+						perceived_CM += another_Pos;
+						perceived_Num1 += 1.0f;
+					}
+					if (disFromThis < rule2Distance)
+					{
+						c -= (another_Pos - posCurr);
+					}
+					if (disFromThis < rule3Distance)
+					{
+						perceived_Vel += vel1[boidInd];
+						perceived_Num3 += 1.0f;
+					}
+				}	
+			}
+		}
+	}
+	//rule1
+	if (perceived_Num1 > 0)
+	{
+		perceived_CM = perceived_CM * (1.0f / perceived_Num1);
+		vel_change += (perceived_CM - posCurr) * rule1Scale;
+	}
+	//rule2
+	vel_change += c * rule2Scale;
+	//rule3
+	if (perceived_Num3 > 0)
+	{
+		perceived_Vel = perceived_Vel * (1.0f / perceived_Num3);
+		vel_change += perceived_Vel * rule3Scale;
+	}
+	glm::vec3 newVel = vel1[boidPosVelInd] + vel_change;
   // - Clamp the speed change before putting the new speed in vel2
+	newVel.x = newVel.x < -maxSpeed ? -maxSpeed : newVel.x;
+	newVel.y = newVel.y < -maxSpeed ? -maxSpeed : newVel.y;
+	newVel.z = newVel.z < -maxSpeed ? -maxSpeed : newVel.z;
+
+	newVel.x = newVel.x > maxSpeed ? maxSpeed : newVel.x;
+	newVel.y = newVel.y > maxSpeed ? maxSpeed : newVel.y;
+	newVel.z = newVel.z > maxSpeed ? maxSpeed : newVel.z;
+	
+	vel2[boidPosVelInd] = newVel;
+}
+
+__global__ void kernUpdateVelNeighborSearchScattered1XGRID(
+	int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	int *particleArrayIndices,
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+	// This is for 1xGRID case & scattered position
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N)
+	{
+		return;
+	}
+	// - Identify the grid cell that this particle is in
+	// - Identify which cells may contain neighbors. In this case there are 27 cells
+	int halfSideCount = gridResolution / 2;
+	int gridSideCount = gridResolution;
+	int boidPosVelInd = particleArrayIndices[index];
+	glm::vec3 posCurr = pos[boidPosVelInd];
+	int gridX = int(floor(posCurr.x * inverseCellWidth)) + halfSideCount;
+	int gridY = int(floor(posCurr.y * inverseCellWidth)) + halfSideCount;
+	int gridZ = int(floor(posCurr.z * inverseCellWidth)) + halfSideCount;
+	//  Then the current boid is located inside the grid of (gridX, gridY, gridZ)
+	//  Just need to check the surrounding 3x3x3 cells
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	glm::vec3 vel_change(0.0f, 0.0f, 0.0f);
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	//compute the perceived center of mass
+	glm::vec3 perceived_CM(0.0f, 0.0f, 0.0f);
+	float perceived_Num1 = 0.0f;
+	// Rule 2: boids try to stay a distance d away from each other
+	glm::vec3 c(0.0f, 0.0f, 0.0f);
+	// Rule 3: boids try to match the speed of surrounding boids
+	glm::vec3 perceived_Vel(0.0f, 0.0f, 0.0f);
+	float perceived_Num3 = 0.0f;
+	//debugging
+
+	for (int zDir = -1; zDir <= 1; zDir++)
+	{
+		for (int yDir = -1; yDir <= 1; yDir++)
+		{
+			for (int xDir = -1; xDir <= 1; xDir++)
+			{
+				//  use module to get desired position
+				//  compute the grid index
+				int gridInd = gridIndex3Dto1D((gridX + xDir + gridSideCount) % gridSideCount,
+					(gridY + yDir + gridSideCount) % gridSideCount,
+					(gridZ + zDir + gridSideCount) % gridSideCount,
+					gridSideCount);
+				//read the start/end indices 
+				int startInd = gridCellStartIndices[gridInd];
+				int endInd = gridCellEndIndices[gridInd];
+
+				//access boids and compute velocity change
+				//use boidPtrInd>=0 to surpass those cells with no boids inside
+
+				for (int boidPtrInd = startInd; boidPtrInd <= endInd && boidPtrInd >= 0; boidPtrInd++)
+				{
+					int boidInd = particleArrayIndices[boidPtrInd];
+					/////following is basically copy from 1.2
+					if (boidInd != boidPosVelInd)
+					{
+						glm::vec3 another_Pos = pos[boidInd];
+						float disFromThis = glm::distance(posCurr, another_Pos);
+						if (disFromThis < rule1Distance)
+						{
+							perceived_CM += another_Pos;
+							perceived_Num1 += 1.0f;
+						}
+						if (disFromThis < rule2Distance)
+						{
+							c -= (another_Pos - posCurr);
+						}
+						if (disFromThis < rule3Distance)
+						{
+							perceived_Vel += vel1[boidInd];
+							perceived_Num3 += 1.0f;
+						}
+					}
+				}
+			}
+		}
+	}
+	//rule1
+	if (perceived_Num1 > 0)
+	{
+		perceived_CM = perceived_CM * (1.0f / perceived_Num1);
+		vel_change += (perceived_CM - posCurr) * rule1Scale;
+	}
+	//rule2
+	vel_change += c * rule2Scale;
+	//rule3
+	if (perceived_Num3 > 0)
+	{
+		perceived_Vel = perceived_Vel * (1.0f / perceived_Num3);
+		vel_change += perceived_Vel * rule3Scale;
+	}
+
+	//debugging
+	glm::vec3 newVel = vel1[boidPosVelInd] + vel_change;
+	//glm::vec3 newVel = vel1[boidPosVelInd];
+	// - Clamp the speed change before putting the new speed in vel2
+	newVel.x = newVel.x < -maxSpeed ? -maxSpeed : newVel.x;
+	newVel.y = newVel.y < -maxSpeed ? -maxSpeed : newVel.y;
+	newVel.z = newVel.z < -maxSpeed ? -maxSpeed : newVel.z;
+
+	newVel.x = newVel.x > maxSpeed ? maxSpeed : newVel.x;
+	newVel.y = newVel.y > maxSpeed ? maxSpeed : newVel.y;
+	newVel.z = newVel.z > maxSpeed ? maxSpeed : newVel.z;
+
+	vel2[boidPosVelInd] = newVel;
+	
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -333,55 +830,510 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // except with one less level of indirection.
   // This should expect gridCellStartIndices and gridCellEndIndices to refer
   // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N)
+	{
+		return;
+	}
+	// - Identify the grid cell that this particle is in
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	int gridstoSee[8];
+	//neighboring cells
+	int xflag, yflag, zflag;
+	int boidPosVelInd = index;
+	glm::vec3 posCurr = pos[boidPosVelInd];
+	int gridX = int(floor(posCurr.x * inverseCellWidth));
+	int gridY = int(floor(posCurr.y * inverseCellWidth));
+	int gridZ = int(floor(posCurr.z * inverseCellWidth));
+	float xfloor = gridX * cellWidth;
+	float yfloor = gridY * cellWidth;
+	float zfloor = gridZ * cellWidth;
+
+	float xDecimalRatio = (posCurr.x - xfloor) / cellWidth;
+	float yDecimalRatio = (posCurr.y - yfloor) / cellWidth;
+	float zDecimalRatio = (posCurr.z - zfloor) / cellWidth;
+
+	int halfSideCount = gridResolution / 2;
+	int gridSideCount = gridResolution;
+
+	//the first grid is the current grid
+	gridstoSee[0] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount, gridZ + halfSideCount, gridSideCount);
+
+
+	//////////////x
+	if (xDecimalRatio <= 0.5)
+	{
+		if (gridX >= -halfSideCount + 1)
+		{
+			gridstoSee[1] = gridIndex3Dto1D(gridX + halfSideCount - 1, gridY + halfSideCount, gridZ + halfSideCount, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[1] = gridIndex3Dto1D(gridSideCount - 1, gridY + halfSideCount, gridZ + halfSideCount, gridSideCount);
+		}
+		xflag = 0;
+	}
+	else
+	{
+		if (gridX < halfSideCount - 1)
+		{
+			gridstoSee[1] = gridIndex3Dto1D(gridX + halfSideCount + 1, gridY + halfSideCount, gridZ + halfSideCount, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[1] = gridIndex3Dto1D(0, gridY + halfSideCount, gridZ + halfSideCount, gridSideCount);
+		}
+		xflag = 1;
+	}
+	//y
+	if (yDecimalRatio <= 0.5)
+	{
+		if (gridY >= -halfSideCount + 1)
+		{
+			gridstoSee[2] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount - 1, gridZ + halfSideCount, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[2] = gridIndex3Dto1D(gridX + halfSideCount, gridSideCount - 1, gridZ + halfSideCount, gridSideCount);
+		}
+		yflag = 0;
+	}
+	else
+	{
+		if (gridY < halfSideCount - 1)
+		{
+			gridstoSee[2] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount + 1, gridZ + halfSideCount, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[2] = gridIndex3Dto1D(gridX + halfSideCount, 0, gridZ + halfSideCount, gridSideCount);
+		}
+		yflag = 1;
+	}
+	//z
+	if (zDecimalRatio <= 0.5)
+	{
+		if (gridZ >= -halfSideCount + 1)
+		{
+			gridstoSee[3] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount, gridZ + halfSideCount - 1, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[3] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount, gridSideCount - 1, gridSideCount);
+		}
+		zflag = 0;
+	}
+	else
+	{
+		if (gridZ < halfSideCount - 1)
+		{
+			gridstoSee[3] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount, gridZ + halfSideCount + 1, gridSideCount);
+		}
+		else
+		{
+			gridstoSee[3] = gridIndex3Dto1D(gridX + halfSideCount, gridY + halfSideCount, 0, gridSideCount);
+		}
+		zflag = 1;
+	}
+	// lets check for diagonal directions
+	//xy
+	float xyIndicator = (xDecimalRatio - xflag) * (xDecimalRatio - xflag) + (yDecimalRatio - yflag) * (yDecimalRatio - yflag);
+	if (xyIndicator <= 0.25)
+	{
+		if (xflag == 0)
+			xflag = -1;
+		if (yflag == 0)
+			yflag = -1;
+		gridstoSee[4] = gridIndex3Dto1D((gridX + halfSideCount + xflag + gridSideCount) % gridSideCount,
+			(gridY + halfSideCount + yflag + gridSideCount) % gridSideCount,
+			gridZ + halfSideCount, gridSideCount);
+	}
+	else
+		gridstoSee[4] = -1;
+	//xz
+	if (xflag < 0)
+		xflag = 0;
+	float xzIndicator = (xDecimalRatio - xflag) * (xDecimalRatio - xflag) + (zDecimalRatio - zflag) * (zDecimalRatio - zflag);
+	if (xzIndicator <= 0.25)
+	{
+		if (!xflag)
+			xflag = -1;
+		if (!zflag)
+			zflag = -1;
+		gridstoSee[5] = gridIndex3Dto1D((gridX + halfSideCount + xflag + gridSideCount) % gridSideCount, gridY + halfSideCount, (gridZ + halfSideCount + zflag + gridSideCount) % gridSideCount, gridSideCount);
+	}
+	else
+		gridstoSee[5] = -1;
+
+	//yz
+	if (yflag < 0)
+		yflag = 0;
+	if (zflag < 0)
+		zflag = 0;
+	float yzIndicator = (zDecimalRatio - zflag) * (zDecimalRatio - zflag) + (yDecimalRatio - yflag) * (yDecimalRatio - yflag);
+	if (yzIndicator <= 0.25)
+	{
+		if (!zflag)
+			zflag = -1;
+		if (!yflag)
+			yflag = -1;
+		gridstoSee[6] = gridIndex3Dto1D(gridX + halfSideCount, (gridY + halfSideCount + yflag + gridSideCount) % gridSideCount, (gridZ + halfSideCount + zflag + gridSideCount) % gridSideCount, gridSideCount);
+	}
+	else
+		gridstoSee[6] = -1;
+
+	//xyz
+	xflag = xflag < 0 ? 0 : xflag;
+	yflag = yflag < 0 ? 0 : yflag;
+	zflag = zflag < 0 ? 0 : zflag;
+	float xyzIndicator = (xDecimalRatio - xflag) * (xDecimalRatio - xflag)
+		+ (yDecimalRatio - yflag) * (yDecimalRatio - yflag)
+		+ (zDecimalRatio - zflag) * (zDecimalRatio - zflag);
+	if (xyzIndicator <= 0.25)
+	{
+		if (!zflag)
+			zflag = -1;
+		if (!yflag)
+			yflag = -1;
+		if (!xflag)
+			xflag = -1;
+		gridstoSee[7] = gridIndex3Dto1D((gridX + halfSideCount + xflag + gridSideCount) % gridSideCount, (gridY + halfSideCount + yflag + gridSideCount) % gridSideCount, (gridZ + halfSideCount + zflag + gridSideCount) % gridSideCount, gridSideCount);
+
+	}
+	else
+		gridstoSee[7] = -1;
+	// - For each cell, read the start/end indices in the boid pointer array.
+	//   DIFFERENCE: For best results, consider what order the cells should be
+	//   checked in to maximize the memory benefits of reordering the boids data.
+    //     for(z)
+	//		 for(y)
+	//			 for(x)
+	//   would be fastest according to the way we compute 1D indices from 3D indices
+	//   here to maximize performance we need to rearrange the gridstoSee array
+	//   this would be clearer in 1x cell size case
+	int temp = 0;
+	temp = gridstoSee[3];
+	gridstoSee[3] = gridstoSee[4];
+	gridstoSee[4] = temp;
+	//   after rearrangement, it would be : self,x,y,xy,z,zx,yz,xyz which is the unrolled sequence of zyx-loop
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	glm::vec3 vel_change(0.0f, 0.0f, 0.0f);
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	//compute the perceived center of mass
+	glm::vec3 perceived_CM(0.0f, 0.0f, 0.0f);
+	float perceived_Num1 = 0.0f;
+	// Rule 2: boids try to stay a distance d away from each other
+	glm::vec3 c(0.0f, 0.0f, 0.0f);
+	// Rule 3: boids try to match the speed of surrounding boids
+	glm::vec3 perceived_Vel(0.0f, 0.0f, 0.0f);
+	float perceived_Num3 = 0.0f;
+
+	for (int tocheckInd = 0; tocheckInd < 8; tocheckInd++)
+	{
+		int gridToCheck = gridstoSee[tocheckInd];
+		if (gridToCheck >= 0)
+		{
+			//read the start/end indices 
+			int startInd = gridCellStartIndices[gridToCheck];
+			int endInd = gridCellEndIndices[gridToCheck];
+			//access boids and compute velocity change
+			//use boidPtrInd>=0 to surpass those cells with no boids inside
+
+			for (int boidPtrInd = startInd; boidPtrInd <= endInd && boidPtrInd >= 0; boidPtrInd++)
+			{
+				int boidInd = boidPtrInd;
+				/////following is basically copy from 1.2
+				if (boidInd != boidPosVelInd)
+				{
+					glm::vec3 another_Pos = pos[boidInd];
+					float disFromThis = glm::distance(posCurr, another_Pos);
+					if (disFromThis < rule1Distance)
+					{
+						perceived_CM += another_Pos;
+						perceived_Num1 += 1.0f;
+					}
+					if (disFromThis < rule2Distance)
+					{
+						c -= (another_Pos - posCurr);
+					}
+					if (disFromThis < rule3Distance)
+					{
+						perceived_Vel += vel1[boidInd];
+						perceived_Num3 += 1.0f;
+					}
+				}
+			}
+		}
+	}
+	//rule1
+	if (perceived_Num1 > 0)
+	{
+		perceived_CM = perceived_CM * (1.0f / perceived_Num1);
+		vel_change += (perceived_CM - posCurr) * rule1Scale;
+	}
+	//rule2
+	vel_change += c * rule2Scale;
+	//rule3
+	if (perceived_Num3 > 0)
+	{
+		perceived_Vel = perceived_Vel * (1.0f / perceived_Num3);
+		vel_change += perceived_Vel * rule3Scale;
+	}
+
+	glm::vec3 newVel = vel1[boidPosVelInd] + vel_change;
+	// - Clamp the speed change before putting the new speed in vel2
+	newVel.x = newVel.x < -maxSpeed ? -maxSpeed : newVel.x;
+	newVel.y = newVel.y < -maxSpeed ? -maxSpeed : newVel.y;
+	newVel.z = newVel.z < -maxSpeed ? -maxSpeed : newVel.z;
+
+	newVel.x = newVel.x > maxSpeed ? maxSpeed : newVel.x;
+	newVel.y = newVel.y > maxSpeed ? maxSpeed : newVel.y;
+	newVel.z = newVel.z > maxSpeed ? maxSpeed : newVel.z;
+
+	vel2[boidPosVelInd] = newVel;
+}
+
+
+//  1X Grid Size Case, use this function
+__global__ void kernUpdateVelNeighborSearchCoherent1XGRID(
+	int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+	// If the gridcellwidth is 1x search distance, we need to check 27 cells
+	// the computation process is actually easier than 2x grid case
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N)
+	{
+		return;
+	}
+	// - Identify the grid cell that this particle is in
+	// - Identify which cells may contain neighbors. Here there are 27 cells
+	int halfSideCount = gridResolution / 2;
+	int gridSideCount = gridResolution;
+	glm::vec3 posCurr = pos[index];
+	int gridX = int(floor(posCurr.x * inverseCellWidth)) + halfSideCount;
+	int gridY = int(floor(posCurr.y * inverseCellWidth)) + halfSideCount;
+	int gridZ = int(floor(posCurr.z * inverseCellWidth)) + halfSideCount;
+	//  Then the current boid is located inside the grid of (gridX, gridY, gridZ)
+	//  Just need to check the surrounding 3x3x3 cells
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	glm::vec3 vel_change(0.0f, 0.0f, 0.0f);
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	//compute the perceived center of mass
+	glm::vec3 perceived_CM(0.0f, 0.0f, 0.0f);
+	float perceived_Num1 = 0.0f;
+	// Rule 2: boids try to stay a distance d away from each other
+	glm::vec3 c(0.0f, 0.0f, 0.0f);
+	// Rule 3: boids try to match the speed of surrounding boids
+	glm::vec3 perceived_Vel(0.0f, 0.0f, 0.0f);
+	float perceived_Num3 = 0.0f;
+
+	for (int zDir = -1; zDir <= 1; zDir++)
+	{
+		for (int yDir = -1; yDir <= 1; yDir++)
+		{
+			for (int xDir = -1; xDir <= 1; xDir++)
+			{
+				//  use module to get desired position
+				//  compute the grid index
+				int gridInd = gridIndex3Dto1D((gridX + xDir + gridSideCount) % gridSideCount,
+					(gridY + yDir + gridSideCount) % gridSideCount,
+					(gridZ + zDir + gridSideCount) % gridSideCount,
+					gridSideCount);
+				//read the start/end indices 
+				int startInd = gridCellStartIndices[gridInd];
+				int endInd = gridCellEndIndices[gridInd];
+
+				//access boids and compute velocity change
+				//use boidPtrInd>=0 to surpass those cells with no boids inside
+
+				for (int boidPtrInd = startInd; boidPtrInd <= endInd && boidPtrInd >= 0; boidPtrInd++)
+				{
+					int boidInd = boidPtrInd;
+					/////following is basically copy from 1.2
+					if (boidInd != index)
+					{
+						glm::vec3 another_Pos = pos[boidInd];
+						float disFromThis = glm::distance(posCurr, another_Pos);
+						if (disFromThis < rule1Distance)
+						{
+							perceived_CM += another_Pos;
+							perceived_Num1 += 1.0f;
+						}
+						if (disFromThis < rule2Distance)
+						{
+							c -= (another_Pos - posCurr);
+						}
+						if (disFromThis < rule3Distance)
+						{
+							perceived_Vel += vel1[boidInd];
+							perceived_Num3 += 1.0f;
+						}
+					}
+				}
+			}
+		}
+	}
+	//rule1
+	if (perceived_Num1 > 0)
+	{
+		perceived_CM = perceived_CM * (1.0f / perceived_Num1);
+		vel_change += (perceived_CM - posCurr) * rule1Scale;
+	}
+	//rule2
+	vel_change += c * rule2Scale;
+	//rule3
+	if (perceived_Num3 > 0)
+	{
+		perceived_Vel = perceived_Vel * (1.0f / perceived_Num3);
+		vel_change += perceived_Vel * rule3Scale;
+	}
+
+	glm::vec3 newVel = vel1[index] + vel_change;
+	// - Clamp the speed change before putting the new speed in vel2
+	newVel.x = newVel.x < -maxSpeed ? -maxSpeed : newVel.x;
+	newVel.y = newVel.y < -maxSpeed ? -maxSpeed : newVel.y;
+	newVel.z = newVel.z < -maxSpeed ? -maxSpeed : newVel.z;
+
+	newVel.x = newVel.x > maxSpeed ? maxSpeed : newVel.x;
+	newVel.y = newVel.y > maxSpeed ? maxSpeed : newVel.y;
+	newVel.z = newVel.z > maxSpeed ? maxSpeed : newVel.z;
+
+	vel2[index] = newVel;
 }
 
+
+/**
+* ping-pong speed buffer
+*/
+__global__ void SwapSpeed(int N, glm::vec3 *vel1, glm::vec3 *vel2)
+{
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	vel1[index].x = vel2[index].x;
+	vel1[index].y = vel2[index].y;
+	vel1[index].z = vel2[index].z;
+	return;
+}
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+ // // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+	kernUpdateVelocityBruteForce <<<fullBlocksPerGrid, blockSize >>> (numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos <<<fullBlocksPerGrid, blockSize >>> (numObjects, dt, dev_pos, dev_vel2);
+ // // TODO-1.2 ping-pong the velocity buffers
+	SwapSpeed<<<fullBlocksPerGrid, blockSize >>>(numObjects, dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
   // TODO-2.1
   // Uniform Grid Neighbor search using Thrust sort.
   // In Parallel:
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // - label each particle with its array index as well as its grid index.
   //   Use 2x width grids.
+	kernComputeIndices<<<fullBlocksPerGrid, blockSize >>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+	dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+
+   //before we do this, need to set default values for start and end index:
+	dim3 fullBlocksPerGrid2((gridCellCount + blockSize - 1) / blockSize);
+	kernResetIntBuffer<<<fullBlocksPerGrid2,blockSize>>>(gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid2, blockSize >> >(gridCellCount, dev_gridCellEndIndices, -1);
+    //then compute the start and end indices
+	kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize >>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
   // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+#if GRID1X
+	kernUpdateVelNeighborSearchScattered1XGRID << <fullBlocksPerGrid, blockSize >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+		dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+#endif
+
+#if GRID2X
+	kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+		dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+#endif  
+	// - Update positions
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+	// - Ping-pong buffers as needed
+	SwapSpeed << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, dev_vel2);
+}
+
+// funtion to rearrange values
+__global__ void shuffleBufferWithIndices(int N, bool back, glm::vec3* originalData1, glm::vec3* shuffledData1, glm::vec3* originalData2, glm::vec3* shuffledData2, int* particleArrayIndices)
+{
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N)
+	{
+		return;
+	}
+	int realPos = particleArrayIndices[index];
+	if (!back)
+	{
+		shuffledData1[index] = originalData1[realPos];
+		shuffledData2[index] = originalData2[realPos];
+	}
+	else
+	{
+		originalData1[realPos] = shuffledData1[index];
+		originalData2[realPos] = shuffledData2[index];
+	}
 }
 
+
 void Boids::stepSimulationCoherentGrid(float dt) {
   // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
   // In Parallel:
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // - Label each particle with its array index as well as its grid index.
   //   Use 2x width grids
+	kernComputeIndices<<<fullBlocksPerGrid,blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+	dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+  // Note - before we do this, need to set default values for start and end index:
+	dim3 fullBlocksPerGrid2((gridCellCount + blockSize - 1) / blockSize);
+	kernResetIntBuffer << <fullBlocksPerGrid2, blockSize >> >(gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid2, blockSize >> >(gridCellCount, dev_gridCellEndIndices, -1);
+  // then compute the start and end indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
   // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
   //   the particle data in the simulation array.
   //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+	shuffleBufferWithIndices<<<fullBlocksPerGrid,blockSize>>>(numObjects, false, dev_pos, dev_posBuffer, dev_vel1, dev_vel1Buffer, dev_particleArrayIndices);
+	//shuffleBufferWithIndices(numObjects, dev_vel2, dev_vel2Buffer, dev_particleArrayIndices);
   // - Perform velocity updates using neighbor search
+#if GRID1X
+	kernUpdateVelNeighborSearchCoherent1XGRID << <fullBlocksPerGrid, blockSize >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+		dev_gridCellStartIndices, dev_gridCellEndIndices, dev_posBuffer, dev_vel1Buffer, dev_vel2Buffer);
+#endif
+
+#if GRID2X
+	kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+		dev_gridCellStartIndices, dev_gridCellEndIndices, dev_posBuffer, dev_vel1Buffer, dev_vel2Buffer);
+#endif
   // - Update positions
+	kernUpdatePos<<<fullBlocksPerGrid,blockSize>>>(numObjects, dt, dev_posBuffer, dev_vel2Buffer);
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	SwapSpeed<<<fullBlocksPerGrid,blockSize>>>(numObjects, dev_vel1Buffer, dev_vel2Buffer);
+	// need to put pos and vel1 back to original position
+	shuffleBufferWithIndices << <fullBlocksPerGrid, blockSize >> >(numObjects, true, dev_pos, dev_posBuffer, dev_vel1, dev_vel1Buffer, dev_particleArrayIndices);
 }
 
 void Boids::endSimulation() {
@@ -390,6 +1342,14 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_gridCellStartIndices);
+
+  cudaFree(dev_posBuffer);
+  cudaFree(dev_vel1Buffer);
+  cudaFree(dev_vel2Buffer);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..fd957ed 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
+#define UNIFORM_GRID 1
 #define COHERENT_GRID 0
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 10000;
 const float DT = 0.2f;
 
 /**