From 846f25c1065f071478edf9464d0600a0f83d21bb Mon Sep 17 00:00:00 2001
From: Palani Johnson <palani.johnson@usu.com>
Date: Fri, 10 Dec 2021 00:42:35 -0700
Subject: [PATCH] readme done

---
 README.md | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index c85885a..0699c97 100644
--- a/README.md
+++ b/README.md
@@ -27,8 +27,40 @@ All other approaches to this problem were mostly modifications to this general a
 
 The solution I found for this problem was incresing the size of the width and height by 2 so that the "real" board sat on the inside of a larger board. While I didn't end up implementing the torous topology for all implementations, I was still able to use this setup to prevent the need for bounds checking. A definition in `game_of_life.h` called `DO_TORIS` enables or disables this feature for any implementations that support it.
 
-The next issue I came accros was an I/O bottleneck. Initialy I was writing the bytes one at a time and periodicly flusing. This, of course, was exceptionaly slow. I eventualy found that writing the bytes to an intermediate video buffer and then using `fwrite` to write them all at once was much faster and completly overcame the I/O bottleneck.
+The next issue I came accros was an I/O bottleneck. Initialy I was writing the bytes one at a time and periodicly flusing. This, of course, was exceptionaly slow. I eventualy found that writing the bytes to an intermediate video buffer and then using `fwrite` to write them all at once was much faster and mostly overcame the I/O bottleneck.
 
-Below are some of the timing plots for this implementation. All the timing studies that are contained here were run on my laptop with an 8 core, 16 thread Intel i7-10875H 2.300GH cpu and a NVIDIA Quadro RTX 5000 Mobile gpu.
+Below are some of the timing plots for this implementation. All the timing studies that are contained here were run on my laptop with an 8 core, 16 thread Intel i7-10875H 2.300GH cpu and a NVIDIA Quadro RTX 5000 Mobile gpu. All studies were performed over 600 iterations (10 seconds of video at 60 fps).
+
+![Timings for serial with I/O](media/timing_serial.png)
+![Timings for serial without I/O](media/timing_serial_no_io.png)
+
+As the board size gets larger the I/O back up becomes more noticable, but this was the best I could get.
 
 ### Shared memory approach with OpenMP: `omp_game.c`
+For this implementation I used OpenMP to help manage the shared memory environment. This was the easiest to implement as the loops were easy to make parallel using the `for` pragmas.
+
+![Timings for OpenMP with I/O](media/timing_openmp.png)
+![Timings for OpenMP without I/O](media/timing_openmp_no_io.png)
+
+One thing that emediatly stood out to me was how much faster the single threaded OpenMP implementation was over the serial implementation. I expected them to take about the same amount of time. This implementation was about 10 times faster, but I don't know why this was the case.
+
+![Speedups for OpenMP with I/O](media/speedup_openmp.png)
+![Speedups for OpenMP without I/O](media/speedup_openmp_no_io.png)
+
+### GPU approach with Cuda: `cuda_game.cu`
+For this implementation I used cuda to program gpu kernals. Managing the memory in this implementation was difficult and required me to be very careful with how I passes struct pointers around.
+
+![Timings for Cuda with I/O](media/timing_cuda.png)
+![Timings for Cuda without I/O](media/timing_cuda_no_io.png)
+
+Overall this implementation wasn't as fast as I had hoped and, after looking at these graphs, it is painfuly clear that I am hitting an I/O bottleneck. 
+
+### Distributed memory approach with MPI: `mpi_game.c`
+
+This implementation was by far the most difficult to implement. My inital approach for this implementation involved tiling the board into smaller mini boards. This approach torned into a mess of confusing pointer arythmatic that made debugging exceptionaly difficult. The approach I ended up landing on had me iterating in steps and spliting the computation by rows and using `MPI_Allreduce` at the end of this step. I also could not figure out filling the video buffers using any distributed methods, so a huge bottleneck occurs during the video buffer writing step of the process.
+
+![Timings for MPI with I/O](media/timing_mpi.png)
+![Timings for MPI without I/O](media/timing_mpi_no_io.png)
+
+![Speedups for MPI with I/O](media/speedup_mpi.png)
+![Speedups for MPI without I/O](media/speedup_mpi_no_io.png)