Skip to content

KOKKOS Branch

Luke Shulenburger edited this page Jun 10, 2019 · 8 revisions

Build instructions

To build with Kokkos, first checkout github.com/kokkos/kokkos.git . Let's assume you've placed the top level Kokkos directory in ${KOKKOS_ROOT}. Then, navigate to the miniqmc/build directory. The following cmake command will build for a Power8 based CPU with OpenMP threading, assuming the GCC compiler is used:

> cmake -DQMC_USE_KOKKOS=1 \
      -DKOKKOS_PREFIX=${KOKKOS_ROOT} \
      -DKOKKOS_ARCH="Power8" \
      -DKOKKOS_ENABLE_OPENMP=true \
      -DKOKKOS_ENABLE_EXPLICIT_INSTANTIATION=false \
      -DCMAKE_CXX_FLAGS="-Drestrict=__restrict__ -D__forceinline=inline" .. 

The CMAKE_CXX_FLAGS are to deal with the handling of the "restrict" and "__forceinline" keywords that appear in miniqmc. Analogous flags for different compilers can be found in their reference.

For CUDA, assuming a Power8 host and a P100 Nvidia Card, we use the following:

> cmake -DQMC_USE_KOKKOS=1 \
      -DKOKKOS_PREFIX=${KOKKOS_ROOT} \
      -DKOKKOS_ENABLE_CUDA=true \
      -DKOKKOS_ENABLE_OPENMP=false \
      -DKOKKOS_ARCH="Power8;Pascal60" \
      -DKOKKOS_ENABLE_CUDA_UVM=true \
      -DKOKKOS_ENABLE_CUDA_LAMBDA=true \
      -DKOKKOS_ENABLE_EXPLICIT_INSTANTIATION=false \
      -DCMAKE_CXX_COMPILER=${KOKKOS_ROOT}/bin/nvcc_wrapper \
      -DCMAKE_CXX_FLAGS="-Drestrict=__restrict__ -D__forceinline=inline " .. 

KOKKOS_ENABLE_CUDA=true and KOKKOS_ENABLE_CUDA_UVM=true must be set. Notice also that the compiler CMAKE_CXX_COMPILER is hijacked by the Kokkos nvcc wrapper.

Runtime Instructions

OpenMP

It is recommended that the following run time variables be set:

  • export OMP_PROC_BIND=spread
  • export OMP_PLACES=threads
  • export OMP_NUM_THREADS=[put available/desired number of threads here]

Moreover, for asynchronous multi-walker moves using the partition_master construct, nested threading must be enabled. This is done with the following:

  • export OMP_NESTED=true
  • export OMP_NUM_THREADS=[comma separated list]

See https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fNUM_005fTHREADS.html

Instructions for the new global_batched_kokkos branch

In order to build, you need to download and checkout the development branch of Kokkos and change to the global_batched_kokkos branch of miniqmc. For example:

> mkdir kokkos
> cd kokkos
> git clone https://github.com/kokkos/kokkos.git .
> git checkout develop
> cd ..
> mkdir miniqmc
> cd miniqmc
> git clone https://github.com/QMCPACK/miniqmc.git .
> git checkout global_batched_kokkos
> cd build

For a CPU build, first identify the architecture of the CPU you will use. On the left is the architecture name for the CPU type on the right.

AMDAVX	AMD CPU
ARMv80	ARMv8.0 Compatible CPU
ARMv81	ARMv8.1 Compatible CPU
ARMv8-ThunderX	ARMv8 Cavium ThunderX CPU
BGQ	IBM Blue Gene Q
Power7	IBM POWER7 and POWER7+ CPUs
Power8	IBM POWER8 CPUs
Power9	IBM POWER9 CPUs
WSM	Intel Westmere CPUs
SNB	Intel Sandy/Ivy Bridge CPUs
HSW	Intel Haswell CPUs
BDW	Intel Broadwell Xeon E-class CPUs
SKX	Intel Sky Lake Xeon E-class HPC CPUs (AVX512)
KNC	Intel Knights Corner Xeon Phi
KNL	Intel Knights Landing Xeon Phi

Now plug this into a cmake command like so after setting KOKKOS_ROOT to the directory where you have KOKKOS.

cmake -DQMC_USE_KOKKOS=1 \
     -DQMC_MIXED_PRECISION=1 \
    -DKOKKOS_PREFIX=${KOKKOS_ROOT} \
    -DKOKKOS_ENABLE_CUDA=false \
    -DKOKKOS_ENABLE_OPENMP=true \
    -DKOKKOS_ARCH="SKX" \
    -DKOKKOS_ENABLE_EXPLICIT_INSTANTIATION=false \
    -DCMAKE_CXX_FLAGS="-Drestrict=__restrict__ -D__forceinline=inline" \
    -DCMAKE_CXX_COMPILER="icpc" \
    ..

Where SKX is replaced with the appropriate architecture from above. Now build with:

make miniqmc_sync_move_noref

The code can now be run as bin/miniqmc_sync_move_noref -g "2 1 1" -n 5 -r 0.99 -16.

For the GPU, things are similar, but the list of GPU architectures includes:

Kepler30	NVIDIA Kepler generation CC 3.0
Kepler32	NVIDIA Kepler generation CC 3.2
Kepler35	NVIDIA Kepler generation CC 3.5
Kepler37	NVIDIA Kepler generation CC 3.7
Maxwell50	NVIDIA Maxwell generation CC 5.0
Maxwell52	NVIDIA Maxwell generation CC 5.2
Maxwell53	NVIDIA Maxwell generation CC 5.3
Pascal60	NVIDIA Pascal generation CC 6.0
Pascal61	NVIDIA Pascal generation CC 6.1
Volta70	NVIDIA Volta generation CC 7.0
Volta72	NVIDIA Volta generation CC 7.2

On a P8+ P100 system you use:

cmake -DQMC_USE_KOKKOS=1 \
      -DQMC_MIXED_PRECISION=1 \
      -DKOKKOS_PREFIX=/ascldap/users/lshulen/new-sandbox/kokkos \
      -DKOKKOS_ENABLE_CUDA=true \
      -DKOKKOS_ENABLE_OPENMP=false \
      -DKOKKOS_ARCH="Power8;Pascal60" \
      -DKOKKOS_ENABLE_CUDA_UVM=false \
      -DKOKKOS_ENABLE_CUDA_LAMBDA=true \
      -DKOKKOS_ENABLE_DEBUG=true \
      -DCMAKE_CXX_COMPILER=/ascldap/users/lshulen/sandbox/kokkos/bin/nvcc_wrapper \
      -DCMAKE_CXX_FLAGS="-G0 -g -Drestrict=__restrict__ -D__forceinline=inline " ..

The code is run in the same way from there.