Skip to content

MPI+Kokkos/OpenACC/OpenMP4.5/stdpar implementation of vlp4d

License

Notifications You must be signed in to change notification settings

yasahi-hpc/vlp4d_mpi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

The vlp4d code solves Vlasov-Poisson equations in 4D (2d space, 2d velocity). From the numerical point of view, vlp4d is based on a semi-lagrangian scheme. Vlasov solver is typically based on a directional Strang splitting. The Poisson equation is treated with 2D Fourier transforms. For the sake of simplicity, all directions are, for the moment, handled with periodic boundary conditions. As a major update from the pervious version, we parallelized the code with MPI and upgrade the interpolatin scheme from Lagrange to Spline.

The Vlasov solver is based on advection's operators:

  • Halo excahnge on (P2P communications)
  • Compute spline coefficient along
  • 2D advection along
  • Poisson solver -> compute electric fields and
  • Compute spline coefficient along
  • 4D advection along directions for

Detailed descriptions of the test cases can be found in

For questions or comments, please find us in the AUTHORS file.

HPC

From the view point of high perfomrance computing (HPC), the code is parallelized with MPI + "X", where "X" is one of a mixed OpenMP3.0/OpenACC, OpenMP3.0/OpenMP4.5, Kokkos and parallel algorithm (experimental). We have investigated optimization strategies applicable to a kinetic plasma simulation code that makes use of the MPI + "X" implementation listed above. The details are presented in the P3HPC workshop 2021. Our previous result for non-MPI version is found in

Test environments

We have tested the code on the following environments.

  • Nvidia Tesla p100 on Tsubame3.0 (Tokyo Tech, Japan)
    Compilers: cuda/10.2.48 + openmpi3.1.4 (Kokkos), pgi19.1 + openmpi3.1.4 (OpenACC)

  • Nvidia Tesla v100 on Marconi100 (Cineca, Italy)
    Compilers cuda/10.2 + spectrum_mpi10.3.1 (Kokkos), Nvidia HPC SDK 20.11-0 (OpenACC)

  • Intel Skylake on JFRS-1 (IFERC-CSC, Japan)
    Compilers (intel compiler 18.0.2)

  • Fujitsu A64FX on Flow (Nagoya Univ., Japan)
    Compilers (Fujitsu compiler 1.2.27)

Usage

Compile

Firstly, you need to git clone on your environment as

git clone https://github.com/yasahi-hpc/vlp4d_mpi.git

Depending on your configuration, you may have to modify the Makefile. You may add your configuration in the same way as

ifneq (,$(findstring p100,$(DEVICES)))
CXX      = mpicxx
CXXFLAGS = -O3 -ta=nvidia:cc60 -Minfo=accel -Mcudalib=cufft,cublas -std=c++11 -DENABLE_OPENACC -DLAYOUT_LEFT
LDFLAGS  = -Mcudalib=cufft,cublas -ta=nvidia:cc60 -acc
TARGET   = vlp4d.tsubame3.0_p100_openacc
endif

Before compiling, you need to load appropriate modules for MPI + CUDA/OpenACC/OpenMP4.5 environment. CUDA-Aware-MPI is necessary for this application. For CPU version, it is also necessary to make sure that fftw is available in your configuration. OpenMP4.5 and stdpar versions are experimental and not appeared in the workshop paper. For OpenMP4.5 and stdpar versions, we have only tested with nvc++ in Nvidia HPC SDK.

OpenACC version

export DEVICE=device_name # choose the device_name from "p100", "v100", "a100", "bdw", "knl", "skx", "a64fx"
cd src_openacc
make

OpenMP4.5 version

This is an experimental version (not appeared in the workshop paper).

export DEVICE=device_name # choose the device_name from "v100", "a100"
cd src_openmp4.5
make

Kokkos version

First of all, you need to install kokkos on your environment. Instructions are found in https://github.com/kokkos/kokkos. In the following example, it is assumed that kokkos is located at "your_kokkos_path".

export KOKKOS_PATH=your_kokkos_path # set your_kokkos_path
export OMPICXX=$KOKKOSPATH/bin/nvccwrapper # Assuming OpenMPI as a MPI library (GPU only)
export DEVICE=device_name # choose the device_name from "p100", "v100", "a100", "bdw", "skx", "a64fx"
cd src_kokkos
make

C++ parallel algorithm (stdpar) version

This is an experimental version (not appeared in the workshop paper). Performance test has been made on A100 GPU. Further optimizations are needed for this version.

export DEVICE=device_name # choose the device_name from "p100", "v100", "a100", "icelake"
cd src_stdpar
make

Run

Depending on your configuration, you may have to modify the job.sh in wk and sub_*.sh in wk/batch_scripts.

cd wk
./job.sh

Experiment workflow

In order to evaluate the impact of optimizations, one has to compile the code on each environment. Here, device_name is the name of the device one use in Makefile and job scripts. We assume that the Installation has already been made successfully. The impact of optimizations can be evaluated by comparing the standard output with different versions.

OpenACC version

export DEVICE=device_name
export OPTIMIZATION=STEP1 # choose from STEP0-2 for CPUs and choose from STEP0-1 for GPUs
cd src_openacc
make
cd ../wk
./job.sh

OpenMP4.5 version

This is an experimental version (not appeared in the workshop paper). It seems important to map GPUs before calling MPI_Init. See wrapper.sh and sub_Wisteria_A100_omp4.5.sh.

export DEVICE=device_name
export OPTIMIZATION=STEP1 # choose from STEP0-1 for GPUs
cd src_openmp4.5
make
cd ../wk
./job.sh

Kokkos version

export DEVICE=device_name
export OMPICXX=$KOKKOSPATH/bin/nvccwrapper # Only for OpenMPI + GPU case
export OPTIMIZATION=STEP1 # choose from STEP0-2 for CPUs and choose from STEP0-1 for GPUs
cd src_kokkos
make
cd ../wk
./job.sh

C++ parallel algorithm (stdpar) version

This is an experimental version (not appeared in the workshop paper). As well as the OpenMP4.5 version, it is recommended to map GPUs before calling MPI_Init. See sub_Wisteria_A100_stdpar.sh.

export DEVICE=device_name
cd src_stdpar
make
cd ../wk
./job.sh

Expected result

If the code works correctly, one may find the standard output file in ascii format showing the timing at the bottom.
The timings look like (though not alingned in the standard output file)

total 4.57123 [s], 1 calls
MainLoop 4.56027 [s], 40 calls
pack 0.14395 [s], 40 calls
comm 0.705633 [s], 40 calls
unpack 0.0556184 [s], 40 calls
advec2D 0.258894 [s], 40 calls
advec4D 1.38773 [s], 40 calls
field 0.0474864 [s], 80 calls
all_reduce 0.116563 [s], 80 calls
Fourier 0.0296469 [s], 80 calls
diag 0.0992476 [s], 40 calls
splinecoeff_xy 0.805345 [s], 40 calls
splinecoeff_vxvy 0.907955 [s], 40 calls

Each column denotes the kernel name, total elapsed time in seconds, and number of call counts. The elapsed time s of a given kernel for a single iteration can be computed by

total elapsed time [s] / number of call counts

The Flops and memory bandwidth are computed by the following formula

Flops = Nf/s,
Bytes/s = Nb/s

where f and b denote the total amount of floating point operation and memory accesses per grid point. N represent the total number of grid points ans s is the elapsed time of a given kernel for a single iteration. f and b presented in Table V of the paper (section VI) are the analytical estimates from the source code.