CIS565-Fall-2018 · ACskyline · Sep 8, 2018 · Sep 8, 2018 · Sep 8, 2018
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -13,7 +13,9 @@ if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
 elseif(${CMAKE_SYSTEM_NAME} MATCHES "Linux")
     set(EXTERNAL_LIB_PATH "${EXTERNAL}/lib/linux" "/usr/lib64")
 elseif(WIN32)
-    if(${MSVC_VERSION} MATCHES "1900")
+    if(${MSVC_VERSION} MATCHES "1915")
+        set(EXTERNAL_LIB_PATH "${EXTERNAL}/lib/win/vc2015")
+    elseif(${MSVC_VERSION} MATCHES "1900")
         set(EXTERNAL_LIB_PATH "${EXTERNAL}/lib/win/vc2015")
     elseif(${MSVC_VERSION} MATCHES "1800")
         set(EXTERNAL_LIB_PATH "${EXTERNAL}/lib/win/vc2013")

diff --git a/README.md b/README.md
@@ -1,11 +1,56 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Xiao Zhang
+  * [LinkedIn](https://www.linkedin.com/in/xiao-zhang-674bb8148/)
+* Tested on: Windows 10, i7-7700K @ 4.20GHz 16.0GB, GTX 1080 15.96GB (my own PC)
 
-### (TODO: Your README)
+### Screenshot
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+The simulation for the screenshot is using uniform grid with coherent storage of position and velocity. There are 20000 boids being simulated and the blocksize for CUDA is set to 128. Vertical synchronization is turned on.
+
+![](images/grid_coherent_20000_128_vsync.gif)
+
+### Analysis
+
+#### The chart below shows the framerate under different blocksize and boids number configuration of all 3 simulation methods.
+
+![](images/chart1.JPG)
+
+#### The graph below shows the framerate under different blocksize and boids number configuration of the naive simulation method.
+
+![](images/graph1.JPG)
+
+#### The graph below shows the framerate under different blocksize and boids number configuration of the uniform grid simulation method with separate memory storage for position and velocity.
+
+![](images/graph2.JPG)
+
+#### The graph below shows the framerate under different blocksize and boids number configuration of the uniform grid simulation method with coherent memory storage for position and velocity.
+
+![](images/graph3.JPG)
+
+#### The chart below shows the framerate under different boids number configuration of all 3 simulation methods when visualization is turned on and CUDA blocksize is set to 128.
+
+![](images/chart2.JPG)
+
+#### The graph below shows the chart above.
+
+![](images/graph4.JPG)
+
+### Q&A
+
+#### 1. For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+
+Increasing the number of boids lower the framerate. This is mainly because there are more data being transfered between GPU and CPU in one frame. Because according to the analysis, framerate doesn't increase when blocksize is larger, which means parallelization is not the problem and every boid is running on their own thread. So the next most possible reason is data throughput.
+
+#### 2. For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
+
+Changing block count and block size hardly affects the performance. This is because the way we are assigning the workload will always assure that every boid will run on their own thread. When we have small block size, the block count will increase to assure that everything still finish in one grid.
+
+#### 3. For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+
+Yes. Yes. Because instead of accessing position and velocity in a separated manner for multiple times, we are using one grid to reshuffle the position and velocity so that we can access them in a coherent manner later. In this reshuffle process, we are still accessing position and velocity in a separated manner, but we only need to do this once. After reshufling them, we can access them more efficiently, because the GPU cache and the principle of locality.
+
+#### 4. Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+
+Not significantly. First, yes, it will affect performance, because there are more cells to check, which means potentially more boids to account for, and the process is not parallelized. But on the other hand, this means each thread has more chance to have equal amount of work to do (extreme case being all boids checking all cells, which is basically the naive method). On GPU, this is a good sign because the same warp will always wait for the slowest to finish, therefore having a balanced workload for each thread will improve performance.
diff --git a/images/chart1.JPG b/images/chart1.JPG
diff --git a/images/chart2.JPG b/images/chart2.JPG
diff --git a/images/graph1.JPG b/images/graph1.JPG
diff --git a/images/graph2.JPG b/images/graph2.JPG
diff --git a/images/graph3.JPG b/images/graph3.JPG
diff --git a/images/graph4.JPG b/images/graph4.JPG
diff --git a/images/grid_coherent_20000_128_vsync.gif b/images/grid_coherent_20000_128_vsync.gif
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_60
     )