-
Notifications
You must be signed in to change notification settings - Fork 16
Building KHARMA
First, be sure to check out all of KHARMA's submodules by running
$ git submodule update --init --recursive
This will grab KHARMA's two main dependencies (as well as some incidental things):
-
The Parthenon AMR framework from LANL (accompanying documentation). Note KHARMA actually uses a fork of Parthenon, see here.
-
The Kokkos performance-portability library, originally from SNL. If they are not present here, many common questions and problems can be answered by the Kokkos wiki and tutorials. Parthenon includes a list of the Parthenon-specific wrappers for Kokkos functions in their developer guide.
The dependencies KHARMA needs from the system are the same as Parthenon and Kokkos:
- A C++17 compliant compiler with OpenMP (tested with
gcc
>= 11, Intelicpc
/icpx
>= 22,nvc++
>= 22.7,clang++
and derivatives >= 12) - An MPI implementation
- Parallel HDF5 compiled against this MPI implementation.
make.sh
can compile this for you.
And optionally
- CUDA >= 11.5 and a CUDA-supported C++ compiler
OR
- ROCm >= 5.3
OR
- The most recent Intel oneAPI release (SYCL/oneAPI support is experimental)
If necessary, KHARMA can also be compiled without MPI; with dedication, you might be able to compile it without HDF5. The results of omitting either are quite useless, though.
KHARMA uses cmake
for building, and has a small set of bash
scripts to handle loading the correct modules and giving the correct arguments to cmake
on specific systems. Configurations for new machines are welcome; existing example scripts are in machines/
.
Generally, on systems with a parallel HDF5 module, one can then run the following to compile KHARMA. Note that clean
here specifies to do a clean build, not to clean an existing one:
./make.sh clean [hip, cuda]
If your system does not have an HDF5 module, KHARMA can attempt to compile one for you. Just add hdf5
to the arguments of make.sh
.
./make.sh clean [hip, cuda] hdf5
When switching compilers, architectures, or devices, you may additionally need to add cleanhdf5
. So, at worst:
./make.sh clean [hip, cuda] hdf5 cleanhdf5
When using KHARMA to compile HDF5, cmake
will print a scary red error message about the HDF5 folder being a subfolder of the source directory. This can be safely ignored, as all build files are still generated successfully. We'll revisit the HDF5 compile if this becomes a real problem.
After the CMake configuration step has completed successfully at least once (you see -- Build files have been written to:
and start getting [X%] Building CXX object
messages), you generally will not need to specify clean
any longer, which can save you a lot of time. After some reconfigurations, git updates, etc, CMake may want to reconfigure itself and builds without clean
may start to fail; just specify clean
again to reset them.
If you run into issues when compiling, remember to check the "Known Incompatibilities" section of this page, as well as the open issues. If the compile breaks on a supported machine, please open a new issue.
As mentioned above, there are two additional arguments to make.sh
specifying dependencies:
-
hdf5
will compile a version of HDF5 inline with building KHARMA, using the same compiler and options. This is an easy way to get a compatible and fast HDF5 implementation, at the cost of extra compile time. The HDF5 build may not work on all systems. -
nompi
will compile without MPI suppport, for running on a single GPU or CPU.
There are several more useful options:
-
debug
will enable theDEBUG
flag in the code, and more importantly enable bounds-checking in all Kokkos arrays. Useful for very weird undesired behavior and segfaults. Note, however, that most KHARMA checks, prints, and debugging output are actually enabled at runtime, under the<debug>
section of the input deck. -
trace
will print each part of a step tostderr
as it is being run (technically, anywhere with aFlag()
call in the code). This is useful for pinning down where segfaults are occurring, without manually bisecting the whole code with print statements. -
noimplicit
will skip compiling the implicit solver. This is really only useful if an update to Parthenon/Kokkos breaks something in thekokkos-kernels
. -
nocleanup
will skip compiling the B field cleanup/simulation resizing support. This is useful if a Parthenon update breaks theBiCGStab
solver.
The most up-to-date option listing can be found at the top of the make.sh
source. Machine files may provide additional options (e.g. for choosing a compiler with gcc
, icc
etc.) -- read the relevant machine file for those.
When compiling on new machines, it's likely that there are specific quirks, library locations, modules, etc. which are needed to make KHARMA compile and run correctly and efficiently. These can be listed together in a single shell script to make the lives of future users (including yourself) much easier. The goal of a machine file is that ./make.sh clean [cuda, hip]
should work for anyone using the same machine and a default clone of KHARMA.
New machine files should generally start from the examples already present in the machines/
folder. Let's take a look at one of the more involved scripts, for OLCF's Frontier supercomputer:
# Config for OLCF Frontier
if [[ $HOST == *".frontier.olcf.ornl.gov" ]]
then
HOST_ARCH=ZEN3
DEVICE_ARCH=VEGA90A
MPI_EXE=srun
NPROC=64
if [[ $ARGS == *"hip"* ]]; then
# HIP compile for AMD GPUs
if [[ $ARGS == *"cray"* ]]; then
module load PrgEnv-cray
module load craype-accel-amd-gfx90a
module load amd-mixed
else
module load PrgEnv-amd
module load craype-accel-amd-gfx90a
fi
module load cray-hdf5-parallel
if [[ $ARGS == *"hipcc"* ]]; then
CXX_NATIVE=hipcc
C_NATIVE=hipcc
export CXXFLAGS="-I$CRAY_HDF5_PARALLEL_PREFIX/include -L$CRAY_HDF5_PARALLEL_PREFIX/lib -l:libhdf5_parallel.a"
else
CXX_NATIVE=CC
C_NATIVE=cc
export CXXFLAGS="-noopenmp -mllvm -amdgpu-function-calls=false $CXXFLAGS"
fi
# Runtime
MPI_NUM_PROCS=8
MPI_EXTRA_ARGS="-c1 --gpus-per-node=8 --gpu-bind=closest"
export MPICH_GPU_SUPPORT_ENABLED=1
export FI_CXI_RX_MATCH_MODE=software
else
# CPU Compile with the defaults
MPI_NUM_PROCS=1
fi
fi
This is an example of all design patterns for machine files:
- They should contain a hostname check (all machine files are parsed when compiling or running, so it is important each one is a no-op except on its intended host). Check the results of
hostname -f
on both the login and compute nodes, and write the strictest possible check to avoid interfering with others. - Machine files can implement their own options, as well as reacting to the
cuda
,hip
, or other common command-line options. The parsing can be loose, as in this file, so long as options do not contain others as sub-strings. - Machine files are responsible for loading any non-default modules, to make the compile consistent and easy for different users.
- Machine files can set both internal variables for
make.sh
, and global environment variables such asCXXFLAGS
- Machine files are parsed at compile time and runtime, so they can also set any necessary environment variables for running (so long as you use
run.sh
!). Any arguments passed at compile time are remembered and set identically at runtime.
Full list of environment variables understood by make.sh
:
-
NPROC
: number ofmake
jobs to launch in parallel, generally the number of processors on a node. Only used for compiling, not for running! If you want particular numbers of OpenMP threads, set those manually. -
C_NATIVE
,CXX_NATIVE
: manually specify the host C and C++ compilers to use.make.sh
will still call the CUDA compiler & Kokkos wrappers correctly, but will direct them to these compilers under the hood. -
HOST_ARCH
,DEVICE_ARCH
: for manually setting the CPU and GPU architectures. Usually not needed anymore unless cross-compiling or optimizing, see next section. -
PREFIX_PATH
: For directing CMake to any libraries it does not find automatically, e.g. HDF5. Also uncommon now, thanks to thehdf5
option. -
EXTRA_FLAGS
: Any more CMake options, for example to override Parthenon compile-time options. -
MPI_EXE
,MPI_NUM_PROCS
,MPI_EXTRA_ARGS
: These control howrun.sh
calls MPI by default. E.g., on Frontier we setrun.sh
to use 8 MPI processes by defualt, and to bind them each to a GPU. Then, instead ofsrun -n 8 -c1 --gpus-per-node=8 --gpu-bind=closest ./kharma.hip -i input.par
we can simply use./run.sh -i input.par
.
And of course, machine files can export variables read by CMake, the OS, or the MPI or OpenMP implementation, e.g.
CXXFLAGS
-
OMP_NUM_THREADS
,OMP_PLACES
,OMP_PROC_BIND
- Anything your computing center tells you to add to your batch scripts, e.g. to work around Cray's bad MPI implementation or whatever
If make.sh
sees a file $HOME/.config/kharma.sh
it will parse that file instead of using the files in machines/
. This way, you can write a file for a private machine once, and any copy of KHARMA on that machine will use your custom file without you having to constantly remember to copy it into machines/
. Private machine files also don't require a hostname check. Example from an M1 Mac with homebrew
:
# Kokkos doesn't auto-detect this for some reason
HOST_ARCH=ARMV81
# *actually* use GCC
export PATH="/opt/homebrew/opt/make/libexec/gnubin:$PATH"
PREFIX_PATH="$HOME/Code/kharma/external/hdf5;/opt/homebrew/"
C_NATIVE=/opt/homebrew/bin/gcc-13
CXX_NATIVE=/opt/homebrew/bin/g++-13
CXXFLAGS="-Wl,-ld_classic"
Various tricks for trying to make KHARMA go faster (or in some cases, fast at all). Remember there are lots of examples of trying to make KHARMA go fast on existing machines, in scripts/batch/
and machines/
. These may be useful when troubleshooting your own performance issues!
The build script make.sh
defaults to compiling for the architecture of the host CPU (and GPU, if enabled). However, you can manually specify a host and/or device architecture. For example, when compiling for CUDA:
HOST_ARCH=CPUVER DEVICE_ARCH=GPUVER ./make.sh clean cuda
Where CPUVER
and GPUVER
are the strings used by Kokkos to denote a particular architecture & set of compile flags, e.g. "SKX" for Skylake-X, "HSW" for Haswell, or "AMDAVX" for Ryzen/EPYC processors, and VOLTA70, TURING75, or AMPERE80 for Nvidia GPUs. A list of a few common architecture strings is provided in make.sh
, and a full (usually) up-to-date list is kept in the Kokkos documentation. (Note make.sh
needs only the portion of the flag after Kokkos_ARCH_
).
If deploying KHARMA to a machine with GPUs, be careful that the MPI stack you use is CUDA-aware -- this allows direct communication from GPUs to the network without involving CPU and RAM, which is much faster. The software for achieving this is system-specific and very low-level: there are some troubleshooting notes below, and examples for particular systems in machines/
, but generally the most reliable option is to contact your system administrators or consult your cluster documentation for help. Generally, KHARMA behaves like any other GPU-aware MPI program -- there's not much we can do to make the process of enabling GPU-aware MPI any easier.
Generally, when running with GPUs enabled, KHARMA does almost no work in parallel across CPU cores. Thus each host process should only require 1-2 CPU cores to perform optimally. However, bad process binding might allocate cores which are not nearby to the accompanying GPUs, or worse, allocate all cores to all MPI ranks, causing contention among all ranks for core 0 while all others sit idle.
To avoid the latter catastrophe, which can hamper performance by 10x, you can simply refrain from setting OMP_PROC_BIND
and OMP_PLACES
(even though Kokkos tells you to do so when starting up KHARMA).
If using srun
on a modern machine, the options --gpus-per-task=1 --gpu-bind=closest
may be available to you -- if so, they are the most reliable way to obtain KHARMA's preferred process binding. Any number of CPU cores can be requested with -c
, and slurm will ensure that each process receives the cores closest to its assigned GPU.
If those options are not available, It is sometimes preferable to use mpirun
command even where srun
is available. If using mpirun
, there is an option --map-by
which controls process mapping across nodes, though not strictly the GPU or core mapping. Various machine files use --map-by
, for example the Delta script defaults to --map-by ppr:4:node:pe=16
, which allocates 4 MPI ranks per node (1 per GPU), each with 16 CPU cores, so as to map the closest CPU cores to each GPU.
If MPI mapping is insufficient, one can control the visible CPU cores and GPU devices completely manually using wrapper scripts. Hopefully if this is necessary your machine's documentation will have an example, or at least a diagram of which CPU cores to pair with which GPU IDs. The scripts related to "Polaris" and "Chicoma," in bin/
and scripts/batch/
will provide two examples if you want to attempt this.
Technically, KHARMA runs most efficiently when each MPI process uses a single CPU core, and no OpenMP parallelism is used. However, since each core must then have a separate simulation block, the resulting mesh will contain hundreds or thousands of meshblocks, which are inefficient to read and process and generally a pain to deal with if one is used to single-block simulations.
Thus for convenience, KHARMA supports OpenMP hybrid parallelism. Generally, one block per NUMA domain (i.e., 1/socket on multi-CPU machines, or sometimes 4/socket on AMD processors) provides the best performance by a small margin. Optimal performance with OpenMP requires setting the OMP_PROC_BIND
and OMP_PLACES
variables, along with suitable MPI or slurm options to mask each MPI process away from the rest of the processor (Generally, --map-by
or some variation, see above). This varies a lot by machine and processor type, and is generally covered in a cluster's documentation (unlike the above GPU+MPI tricks, which seemingly remain arcane almost everywhere).
Generally, the best way to troubleshoot problems/segfaults/slowness that appear with two GPUs (but not one) is to contact your system administrator. The solutions will often be specific to both what MPI implementation is used (OpenMPI vs MPICH vs Cray), and potentially what “transport”/“fabric” is used (UCX vs different OFI implementations). Additionally, it can depend on compiler stacks (with NVHPC being the most reliable) and how Slurm is configured (it can have knowledge of the GPUs or not).
One thing to note when troubleshooting with cluster admins is KHARMA uses device-side MPI buffers, aka Remote Direct Memory Access (RDMA), sometimes called CUDA-aware/GPU-aware MPI. If the cluster doesn’t support this, it can be turned off by adding -DPARTHENON_ENABLE_HOST_COMM_BUFFERS=ON
in the EXTRA_FLAGS
variable, described above. Usually, clusters can support RDMA within one node without too much effort (Nvidia’s NVHPC stack in particular usually does this well), but setting up RDMA across a network is much more rare -- some ACCESS resources even lack this facility. Within one node, there might be an environment variable one can use to enable the fast one-node transport with RDMA -- look for ‘shared memory’ or ‘shm’.
Remember that there are a ton of examples of running KHARMA on different machines in scripts/batch
. Some of those files are basically just wrappers around run.sh
, in which case all the relevant intelligence is in the associated machine file machines/name.sh
. KHARMA has been run on nearly every architecture, software stack, and cluster fabric imaginable, from cellphones to the exascale, so it's likely you can get it running in some capacity!
Generally, the compiler versions present on modern supercomputers or operating systems will work fine to compile KHARMA, but be careful using older compilers and software. Here's an incomplete list of known bad combinations:
- CUDA >= 12.2 does not compile Extended MHD correctly in KHARMA 2024.9+. If you are planning to use specifically the
emhd
package on Nvidia (implicit
by itself is fine), then either use KHARMA 2024.5.1 or CUDA 11.8 or 12.0, which are confirmed working. This is likely a Kokkos issue, you might have luck upgrading it. - There is also a known issue with new Kokkos versions >= 4.1 and CUDA-aware MPI on Nvidia cards: KHARMA should default to Kokkos 4.0 now to avoid it, but this may pop up when updating Kokkos.
- When compiling with CUDA 11, there can be an internal
nvcc
errorPHINode should have one entry for each predecessor of its parent basic block!
. CUDA 12 does not show this issue. - If you attempt to compile KHARMA with a version of CUDA before 11.2,
nvcc
will crash during compilation with the error:Error: Internal Compiler Error (codegen): "there was an error in verifying the lgenfe output!"
(see the relevant Kokkos bug). This is a bug innvcc
's support ofconstexpr
in C++17, fixed in 11.2. This appears to be independent of which host compiler is used, but be aware that on some systems, the compiler choice affects which CUDA version is loaded. - CentOS 7 and derivatives ship an extremely old default version of
gcc
andlibstdc++
. If possible on such machines, load a newergcc
as a module, which might bring with it a more recent standard library as well (other compilers, such asicc
orclang
rely on the system version oflibstdc++
, and thus even new versions of these compilers may have trouble compiling KHARMA on old operating systems). - GCC version 7.3.0 exactly has a bug making it incapable of compiling a particular Parthenon function, fixed in 7.3.1 and 8+. It is for unfathomable reasons very widely deployed as the default compiler on various machines, but if any other stack is available it should be preferred. Alternatively, the function contents can be commented out, as it isn't necessary in order to compile KHARMA.
- NVHPC toolkit versions prior to 23.1 can have one of two issues: 21.3 to 21.7 have trouble compiling Parthenon's C++14 constructs, and 21.9 through 22.11 may try to import a header,
pmmintrinsics.h
, which they cannot compile. The latter is uncommon, but the newest available NVHPC is always preferred. - Generally only the very most recent release of Intel oneAPI is "supported," which is to say, has any chance of compiling KHARMA. SYCL is still a moving target and impossible to really support without access to working hardware.
- IBM XLC is unsupported in modern versions of KHARMA. YMMV with XLC's C++17 support, check out older KHARMA releases if you need to revert to C++14.
- The
mpark::variant
submodule inexternal/variant/
often requires patches to compile on devices (HIP/CUDA/SYCL). These should be automatically applied, but checkexternal/patches/
for the relevant patch if you encounter errors compiling it.
KHARMA uses a lot of resources per process, and by default uses a lot of processes to compile (NPROC
in make.sh
or machine files, which defaults to the total number of threads present on the system). This is generally fine for workstations and single nodes, however on popular login nodes or community machines you might see the following (e.g. on Frontera):
...
icpc: error #10103: can't fork process: Resource temporarily unavailable
make[2]: *** [external/parthenon/src/CMakeFiles/parthenon.dir/driver/multistage.cpp.o] Error 1
...
...
This means that make
can't fork new compile processes, which of course ruins the compile. You can find a less popular node (e.g. with a development job), or turn down the NPROC
variable at the top of make.sh
, or wait until the node is not so in-demand.
When compiling a version of KHARMA that is also being run, the OS will (logically) not replace the binary file kharma.x
being used by the running program. The error is usually something like cp: cannot create regular file '../kharma.host': Text file busy
. To correct this, invoke make.sh
again when the run is finished or stopped (or manually run cp build/kharma/kharma.host .
).