Skip to content
This repository has been archived by the owner on Jan 18, 2025. It is now read-only.

Commit

Permalink
readme done
Browse files Browse the repository at this point in the history
  • Loading branch information
Palani Johnson committed Dec 10, 2021
1 parent e48d48d commit 846f25c
Showing 1 changed file with 34 additions and 2 deletions.
36 changes: 34 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,40 @@ All other approaches to this problem were mostly modifications to this general a

The solution I found for this problem was incresing the size of the width and height by 2 so that the "real" board sat on the inside of a larger board. While I didn't end up implementing the torous topology for all implementations, I was still able to use this setup to prevent the need for bounds checking. A definition in `game_of_life.h` called `DO_TORIS` enables or disables this feature for any implementations that support it.

The next issue I came accros was an I/O bottleneck. Initialy I was writing the bytes one at a time and periodicly flusing. This, of course, was exceptionaly slow. I eventualy found that writing the bytes to an intermediate video buffer and then using `fwrite` to write them all at once was much faster and completly overcame the I/O bottleneck.
The next issue I came accros was an I/O bottleneck. Initialy I was writing the bytes one at a time and periodicly flusing. This, of course, was exceptionaly slow. I eventualy found that writing the bytes to an intermediate video buffer and then using `fwrite` to write them all at once was much faster and mostly overcame the I/O bottleneck.

Below are some of the timing plots for this implementation. All the timing studies that are contained here were run on my laptop with an 8 core, 16 thread Intel i7-10875H 2.300GH cpu and a NVIDIA Quadro RTX 5000 Mobile gpu.
Below are some of the timing plots for this implementation. All the timing studies that are contained here were run on my laptop with an 8 core, 16 thread Intel i7-10875H 2.300GH cpu and a NVIDIA Quadro RTX 5000 Mobile gpu. All studies were performed over 600 iterations (10 seconds of video at 60 fps).

![Timings for serial with I/O](media/timing_serial.png)
![Timings for serial without I/O](media/timing_serial_no_io.png)

As the board size gets larger the I/O back up becomes more noticable, but this was the best I could get.

### Shared memory approach with OpenMP: `omp_game.c`
For this implementation I used OpenMP to help manage the shared memory environment. This was the easiest to implement as the loops were easy to make parallel using the `for` pragmas.

![Timings for OpenMP with I/O](media/timing_openmp.png)
![Timings for OpenMP without I/O](media/timing_openmp_no_io.png)

One thing that emediatly stood out to me was how much faster the single threaded OpenMP implementation was over the serial implementation. I expected them to take about the same amount of time. This implementation was about 10 times faster, but I don't know why this was the case.

![Speedups for OpenMP with I/O](media/speedup_openmp.png)
![Speedups for OpenMP without I/O](media/speedup_openmp_no_io.png)

### GPU approach with Cuda: `cuda_game.cu`
For this implementation I used cuda to program gpu kernals. Managing the memory in this implementation was difficult and required me to be very careful with how I passes struct pointers around.

![Timings for Cuda with I/O](media/timing_cuda.png)
![Timings for Cuda without I/O](media/timing_cuda_no_io.png)

Overall this implementation wasn't as fast as I had hoped and, after looking at these graphs, it is painfuly clear that I am hitting an I/O bottleneck.

### Distributed memory approach with MPI: `mpi_game.c`

This implementation was by far the most difficult to implement. My inital approach for this implementation involved tiling the board into smaller mini boards. This approach torned into a mess of confusing pointer arythmatic that made debugging exceptionaly difficult. The approach I ended up landing on had me iterating in steps and spliting the computation by rows and using `MPI_Allreduce` at the end of this step. I also could not figure out filling the video buffers using any distributed methods, so a huge bottleneck occurs during the video buffer writing step of the process.

![Timings for MPI with I/O](media/timing_mpi.png)
![Timings for MPI without I/O](media/timing_mpi_no_io.png)

![Speedups for MPI with I/O](media/speedup_mpi.png)
![Speedups for MPI without I/O](media/speedup_mpi_no_io.png)

0 comments on commit 846f25c

Please sign in to comment.