A Highly Efficient GPU Framework for Fast Fourier Transform across Mixed Nodes

About

This library is being developed to support highly efficient distributed FFT computation on multicore CPU and GPU architectures.

DistributeFFT for docker (quick start)

Please refer distributedfft-docker.

Dependencies and libraries have been installed in the docker image, but the increase of commnuication overhead in the contained leads to inevitable performance loss.

Structure

templateFFT (templated-based FFT library on one single GPU)
3dmpifft_roc (optimozed distributed FFT; backened: rocFFT)
3dmpifft_opt (optimized distributed FFT; backened: template-based FFT)
heffte (current state-of-art distributed FFT library heffte)

Build environments:

Compiler (GCC and hipcc)
CMake (v3.0)
ROCm platform (v4.2)
rocfft (v4.2)
OpenMPI (v4.0.0) and UCX (v1.11.0)

Getting Start Guide

Compile and run batched ffts on one single GPU:

cd template/batchTest && make 
bash runTest1D_opt.sh    # performance test of batched 1D templated-based FFT
bash runTest1D_roc.sh    # performance test of batched 1D rocfft

Evaluation results in the manuscript are recorded in csv folder and sample kernels is presented in kernel folder.

Typical CMake build for distributed FFT follow the steps:

cd 3dmpifft_opt (or 3dmpifft_roc) && mkdir build && cd build
cmake -DMPI_DIR=$OMPI_DIR -DCMAKE_BUILD_TYPE=RELEASE ..
make -j

Run script (customize <MPI_DIR> in line 2 of this script and nodelist for multinodes) and sample output:

sh speedTest.sh <MPI-RANK> <X> <Y> <Z> 

t0: 0.003835, t1: 0.003627, t2: 0.016155, t3: 0.005500, total: 0.029118
t0: 0.004212, t1: 0.004101, t2: 0.014490, t3: 0.005405, total: 0.028208
t0: 0.004113, t1: 0.003239, t2: 0.014504, t3: 0.005496, total: 0.027352
t0: 0.004251, t1: 0.003561, t2: 0.014023, t3: 0.005489, total: 0.027324
----------------------------------------------------------------------------- 
distributed FFT performance test
----------------------------------------------------------------------------- 
Size:             512x512x512
MPI ranks:        4
Forward FFT time: 0.0281308 (s)
Performance:      644.112 GFlops/s
Max error:        4.21468e-15

t0, t1, t2 and t3 are batched 2D FFT on dimension YZ, local transpose, all-to-all communication and batched 1D FFT on dimension X, respectively. Time of all-to-all communication depends mainly on the MPI library and the PCI, which means that t2 may fluctuate. Despite of our optimization on t0 and t3, the overall GFlops may also vary massively because of mere 0.01s in all-to-all communication. Therefore, data in the manuscript is just the reference value of repeated experiments.

Comparison with heffte (no need to compile the whole project):

cd heffte/heffteBenchmark/benchmarks && make
sh heffteSpeed.sh <MPI-RANK> <X> <Y> <Z> 

----------------------------------------------------------------------------- 
heFFTe performance test
----------------------------------------------------------------------------- 
Backend:   rocfft
Size:      512x512x512
MPI ranks:    4
Grids: (1, 2, 2)  (2, 1, 2)  (2, 2, 1)  (1, 2, 2)  
Time per run: 0.0558535 (s)
Performance:  324.41 GFlops/s
Memory usage: 2560MB/rank
Tolerance:    1e-11
Max error:    4.88555e-15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

A Highly Efficient GPU Framework for Fast Fourier Transform across Mixed Nodes

About

DistributeFFT for docker (quick start)

Structure

Build environments:

Getting Start Guide

Files

README.md

Latest commit

History

README.md

File metadata and controls

A Highly Efficient GPU Framework for Fast Fourier Transform across Mixed Nodes

About

DistributeFFT for docker (quick start)

Structure

Build environments:

Getting Start Guide