This project is for the course project of Parallel Programming in NYCU CSIE. We implement the sparse matrix multiplication Parallel Optimization in PyTorch extension. We also provide a benchmark tool to compare the performance of different implementations. Currently, we have implemented the following methods:
- PyTorch serial implementation
- Parallel friendly structure implementation (still serial)
- OpenMP implementation by Frankie and DandinPower
- OpenMP + memory efficient implementation by Leo
- std::thread implementation by Frankie and DandinPower
The OpenMP + memory efficient implementation is the most memory efficient one. For the fastest implementation, we can find in different scenarios, the std::thread, openmp, and openmp + memory efficient all have their advantages. You can check the benchmark result in the logs folder.
Our Evaluation platform is:
- AMD Ryzen 9 5950X 16-Core Processor (16cores)
- Ubuntu 22.04 LTS
- python 3.10.12
Before you can compile the PyTorch extension, you need to install the necessary requirements. Run the following command in your terminal:
pip install -r requirements.txt
After installing the prerequisites, navigate to the pytorch_extension directory and run the setup file to compile the PyTorch extension:
cd pytorch_extension
bash run.sh
This project uses the pytest module for unit testing. To test the PyTorch extension implementation, run the following command:
pytest ./test
After you compile the PyTorch extension, and run the tests, you can use our benchmark code to compare different implementations of sparse matrix multiplication. We have provided 2 type of benchmark strategies:
-
Benchmarking the SPMM function with end-to-end time, with different density, different threads and matrix size.
bash benchmark.sh
-
Benchmarking the SPMM function with MNIST test dataset, with different density, different threads.
bash mnist_benchmark.sh
-
run all benchmarks
bash all_benchmark.sh
for each benchmark, you can change the parameters in the script file. You can see the parameters description in the script file.
Note: you must set the threads number fit into your CPU core number, also you need to care about the core has same performance or not. For example, in intel CPU, the performance core will faster than the efficiency core.
- for those which pytorch built-in function didn't include by torch/extension.h, you need to include the right file like
you need to include
at::native::StridedRandomAccessor
#include <ATen/native/StridedRandomAccessor.h>
You are welcome to contribute to this project. If you have any questions, please feel free to contact us. If you don't know what to do, you can check the issues page.