Ryoanji is a Barnes-Hut N-body solver for gravity and electrostatics. It employs EXAFMM multipole kernels and a Barnes-Hut tree-traversal algorithm inspired by Bonsai. Octrees and domain decomposition are handled by Cornerstone Octree, see Ref. [1].
Ryoanji is optimized to run efficiently on both AMD and NVIDIA GPUs, though a CPU implementation is provided as well.
Ryoanji.git
├── README.md
├── cstone - Cornerstone library: octree building and domain decomposition
│ (git subtree of https://github.com/sekelle/cornerstone-octree)
│
└── ryoanji - Ryoanji: N-body solver
├── src
└── test
├── demo.cu - single-rank demonstrator app
├── demo_mpi.cpp - multi-rank demonstrator app
├── interface
│ └─── global_forces_gpu.cpp - multi-rank correctness check vs. direct sum
├── nbody
└── test_main.cpp
Ryoanji is written in C++ and CUDA. The host .cpp
translation units require a C++20 compiler
(GCC 11 and later, Clang 14 and later), while .cu
translation units are compiled in the C++17 standard.
CUDA version: 11.6 or later, HIP version 5.2 or later.
NVIDIA CUDA, A100
CC=mpicc CXX=mpicxx cmake -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CUDA_FLAGS=-ccbin=mpicxx -DGPU_DIRECT=<ON/OFF> <GIT_SOURCE_DIR>
make -j
AMD HIP, MI250x
In order to build the code for AMD GPUs, the source code (.cuh
and .cu
only) must be hipified.
CMake files are already prepared to work with HIP and should not need to be modified.
cd <GIT_SOURCE_DIR>
hipify-perl -inplace `find -name *.cuh -o -name *.cu` && find -name *.prehip -delete
CC=mpicc CXX=mpicxx cmake -DCMAKE_HIP_ARCHITECTURES=gfx90a -DGPU_DIRECT=<ON/OFF> <GIT_SOURCE_DIR> && make -j
One particle-particle (P2P) interaction counts as 23 flops, a multipole-particle (M2P) interaction with
spherical hexadecapoles (P=4)
counts as 2 * P^3 = 128
flops. The performance numbers given below only
take P2P and M2P into account. Additional floating point operations due to tree node evaluations
(multipole acceptance criteria, MAC) or warp-padding overheads are not taken into account.
The opening angle theta
was set to 0.5
.
-
1 x NVIDIA A100: 10.4 TFlop/s (FP32) per GPU, 62.2 million particles / second per GPU, 67 million particles total
-
4 x NVIDIA A100: 10.9 TFlop/s (FP32) per GPU, 35.5 million particles / second per GPU, 3 billion particles total
-
1x AMD MI250X: 15.1 TFlops/s (FP32) per GPU (2 GCDs), 60.0 million particles / second per GPU, 67 million particles total
-
4x AMD MI250X: 15.4 TFlops/s (FP32) per GPU (2 GCDs), 50.0 million particles / second per GPU, 3 billion particles total
-
8208x AMD MI250X (LUMI-G): ~107 PFlops/s (FP64), 44.3 million particles / second per GPU (2 GCDs), 8 trillion particles total (in 22.2 seconds) [1]
Note: the multi-rank demonstrator app provided here initializes random particles on all MPI ranks for the same spatial domain. This requires all-to-all communication to construct the sub-domains of each rank and is not feasible for large number of ranks. In order to construct domains for trillions of particles such as in Ref. [1], optimized initialization strategies are required that places particles into the correct sub-domains. This is possible for Space-Filling-Curve (SFC) sorted input files or for in-situ initialization for particle ensembles with known (density) distribution functions. An application front-end that implements this capability, in addition to I/O and a time-stepping loop is available as part of the SPH-EXA project.
The demonstrator apps are configured by default to use an opening angle of theta = 0.5
and
spherical hexadecapole (P=4)
expansions. This yields a 1st-percentile error of ~5e-4
in the accelerations.
$$$ mpiexec -np 8 ./interface/global_forces_gpu
rank 0 1st-percentile acc error 0.000410922, max acc error 0.00267019
rank 1 1st-percentile acc error 0.000501579, max acc error 0.00327092
rank 2 1st-percentile acc error 0.000362208, max acc error 0.00280561
rank 3 1st-percentile acc error 0.000481996, max acc error 0.0251728
rank 4 1st-percentile acc error 0.000579059, max acc error 0.0110242
rank 5 1st-percentile acc error 0.000442119, max acc error 0.00426394
rank 6 1st-percentile acc error 0.000470549, max acc error 0.00187002
rank 7 1st-percentile acc error 0.000527458, max acc error 0.00332407
global reference potential -0.706933, BH global potential -0.706931
[1] S. Keller et al. 2023, Cornerstone: Octree Construction Algorithms for Scalable Particle Simulations
- Sebastian Keller
- Rio Yokota