Kokkos

Purpose

Kokkos is a programming model aiming to provide performance portability -- that is, code which can be written once, and compiled to make use of a variety of architectures in an efficient way. This means more than just compiling for any architecture, as what was efficient for CPUs is rarely efficient on GPUs, and vice versa. Thus Kokkos improves on straight portability frameworks like OpenACC, by adapting not just the compilation process but the whole implementation.

"Adapting the implementation" is naturally complex -- it involves abstracting operations, so that the programmer no longer determines things like the layout of arrays in memory, or the particular ordering of loops. Instead, these things (along with the syntax transformations for e.g. CUDA vs OpenMP) are handled by Kokkos, through the use of C++ templates.

If this sounds hacky, that's because ultimately it is. Kokkos stretches even an already-powerful language like C++ to its limit, and benefits from the newer C++ standards (C++17 in KHARMA, and additionally from C++20). Ultimately the point is to hide all that complexity: Kokkos lets you write what you want to happen, then uses C++ templates to transform that request into the details of what actually happens. Generally, when compiling KHARMA, if you get long, winding errors involving templates, they are issues with how you are using Kokkos.

Luckily the portion of the Kokkos API that we use is very simple, so there are a limited number of potential errors we'll cover at the end of this article. Generally, if you keep good hygiene about how you are calling loops, your code should end up with a structure very similar to a standard C for loop, with the advantage that it will be able to compile efficiently for any architecture.

Lambdas

The major caveat to the above statement, and the most confusing part of working with Kokkos, is the extensive use of so-called "lambda" functions. These are functions defined on the fly inside other code -- they act like variables. That is, they can be named, assigned, defined, and changed like variables. Most importantly for us, though, they can be passed to functions. When using Kokkos, the body of every loop is a lambda function, which is passed to the enclosing function Kokkos::parallel_for (or its Parthenon wrapper, pmb->par_for). This enclosing function handles copying to the device anything the function needs, and running the function over the given range of parameters.

The great feature of lambda functions, for us, is their ability to "capture" names from the enclosing space. That is, if I define a variable gam inside my (host-side, enclosing) function, I can use it both in the function code (which runs on the CPU or "host"), and in the lambda code (which might be running on the GPU, or "device"). This happens without ever explicitly indicating that I want it to be moved to the GPU! This comes as a relief to veterans of OpenMP offloading, OpenACC, and CUDA, which require very explicit indications of what should happen to enclosing variables when a parallel section begins.

Copying a lot of data to and from the GPU can be slow, but KHARMA always avoids copying large, grid-based data. Only individual scalars and small structs are ever copied, directly, with all the important data left resident on the GPU at all times, in the form of Kokkos's multidimensional arrays, Views.

Views

View objects are host-side handles for device-side N-dimensional arrays. They are both a way of organizing device memory (by providing a function () to access a buffer like an N-dimensional array in e.g. numpy) and a way of managing memory (in that they are reference-counted, and deallocate the attached GPU memory when destroyed). There are host-side View objects as well, but KHARMA does not make much use of them. Generally, Views in KHARMA can only be accessed on the device. However, as long as you access them only within par_for or par_reduce loops, you should be able to forget you're using anything more complex than a numpy NDArray object. Full documentation on Views is here.

Example host-side function

Let's look at an example function, Flux::AddGeoSource from kharma/flux/flux.cpp. This function adds the source term on the RHS of the GRMHD equations. In most HARM implementations, this is done as a single step alongside the calculation of the flux divergence at zone faces. However, to maintain flexibility in KHARMA, we use a Parthenon function to update the zone centers (Update::FluxDivergence), and then apply the source term afterward in Flux::AddGeoSource. The function is a simple application of a single function to each zone, making it an ideal starting case to explain the basic conventions involved in building a Kokkos kernel in KHARMA.

Don't worry about all of the names here just yet. We'll return to this function and describe the Parthenon pieces (TaskStatus, MeshData, etc) later on.

TaskStatus GRMHD::AddSource(MeshData<Real> *md, MeshData<Real> *mdudt)
{
    FLAG("Adding GRMHD source");
    // Pointers
    auto pmesh = md->GetMeshPointer();
    auto pmb0 = md->GetBlockData(0)->GetBlockPointer();
    // Options
    const Real gam = pmb0->packages.Get("GRMHD")->Param<Real>("gamma");

    // Pack variables
    PackIndexMap prims_map, cons_map;
    auto P = GRMHD::PackMHDPrims(md, prims_map);
    auto dUdt = GRMHD::PackMHDCons(mdudt, cons_map);
    const VarMap m_u(cons_map, true), m_p(prims_map, false);
    // Get sizes
    IndexDomain domain = IndexDomain::interior;
    auto ib = md->GetBoundsI(domain);
    auto jb = md->GetBoundsJ(domain);
    auto kb = md->GetBoundsK(domain);
    auto block = IndexRange{0, P.GetDim(5)-1};

    const auto& G = dUdt.coords;

    pmb0->par_for("grmhd_source", block.s, block.e, kb.s, kb.e, jb.s, jb.e, ib.s, ib.e,
        KOKKOS_LAMBDA_MESH_3D {
            // Then calculate and add the GRMHD source term
            FourVectors Dtmp;
            GRMHD::calc_4vecs(G(b), P(b), m_p, k, j, i, Loci::center, Dtmp);
            GRMHD::add_source(G(b), P(b), m_p, Dtmp, gam, k, j, i, dUdt(b), m_u);
        }
    );

    FLAG("Added");
    return TaskStatus::complete;
}

The big thing here is the call pmb0->par_for(), which maps to the Kokkos function Kokkos::parallel_for(). Parthenon wraps it as a part of the class MeshBlockData, so that each meshblock can be given a different "stream" -- a limited way to run functions on different blocks in parallel on the GPU, which is occasionally useful but isn't being used here. Let's unpack the call a bit.

Name

The first argument is a string, "grmhd_source", which simply names the function. Kokkos and Parthenon allow naming both functions and arrays, which is surprisingly useful as profiling and error messages can include these names. Generally, KHARMA uses snake_case descriptive names mirroring the enclosing function (e.g. "grmhd_source" for GRMHD::AddSource). Since many functions have more than one kernel, these names cannot necessarily be 1:1 with function names. Also note that while it will not produce a compiler error to name two functions with the same string, it is incredibly confusing and messes with profiling code, so please be careful copy/pasting Kokkos kernel calls.

Bounds

Next, the call includes some even number of integers, representing the desired index bounds. These are inclusive -- that is, serially the last index of this call would map to the construction for(int i=ib.s; i <= ib.e; ++i). As Parthenon arrays are zero-indexed with inclusive bounds, a full-array loop for size N1 would look like for(int i=0; i <= N1-1; ++i). This will require some adjustment from people used to c conventions based around <, but a standard had to be chosen, and this was it.

Function

The next argument is a lambda function, consisting of a signature (KOKKOS_LAMBDA_MESH_3D) and a body (everything between curly braces). In KHARMA, we name the various possible function signatures using descriptive macros defined near the end of decs.hpp, and I'd recommend consulting that file quickly to get a sense of the naming conventions for indices[^1]. Generally, the last three indices are ordered N3, N2, N1 over the mesh as the indices k, j, i respectively. Primitives or variables are indexed with p, vectors are indexed with mu, and MeshBlocks inside a MeshBlockPack are indexed with b (more on these later). In this case, we want a loop over the entire MeshData in 3D (this reduces automatically in 1D and 2D, which just set kb.s == kb.e and jb.s == jb.e respectively). This gives us a function with 4 indices: b, k, j, i.

Following the declaration is the lambda function body, which can make use of any variables from the enclosing function, as long as the names are valid -- that is, in order to use the coordinate object G inside the lambda function, it must first be defined in the enclosing function. This admittedly leads to a lot of boilerplate, and makes it very important to keep to standard naming schemes for variables! That is, if we remember always to call the coordinates G, primitives P, and so on, and KHARMA will stay much more readable. This will be referred to as "pulling out" names, as generally it consists of defining standard names for pieces of the fluid state variable passed to each function in KHARMA. Luckily, for similar operations you can copy/paste existing "preamble" code and generally have the variables you need. Don't worry about the types used here (or the extensive use of auto[^2]) just yet -- these are all pieces of Parthenon, which I'll describe in the following section.

Finally, note that function calls inside the lambda are in snake_case. These calls can only be made to special functions defined in KHARMA's headers. More on the overall structure here later, but you should remember that anything you're calling inside lambda functions (that is, on the device) should be in snake_case, and anything you're calling outside, on the host, should be in CamelCase. A full list of such KHARMA-specific conventions can be found on the wiki here.

Reductions

Finally, I should mention another feature, though it's not present in this function: Kokkos calls can also perform reductions (e.g. sum, max, etc). These are indicated by calling pmb->par_reduce() instead of par_for, and adding a Kokkos reduction (e.g. Kokkos::Sum) as the last argument. The lambda function macros for these operations end in _REDUCE and add an argument local_result, to which the lambda should assign values which fit the reduction. This is simpler than it seems: an example can be found in the function CheckNaN in debug.cpp, and a more advanced example in GRMHD::EstimateTimestep in grmhd/grmhd.cpp.

[^1]: For anyone curious, the bare KOKKOS_LAMBDA macro used in these definitions now maps to a standard C++ capture-by-value lambda, [=]. There's nothing magic in these definitions, just syntactic sugar to avoid me typing const another 400+ times.

[^2]: The use of auto for all of these definitions ensures that KHARMA can accommodate name and structure changes to Parthenon's types more easily, as long as new types share the same usage conventions (e.g. operator()) as old types.

Example device-side function

While there is a lot of complexity in GRMHD::AddSource, it doesn't, you know, actually add the source term. It exists purely to efficiently map the function GRMHD::add_source to every element of a whole mesh. This is common in KHARMA, as it separates the boilerplate C++ code from the more C-like physics implementations. For completeness, I'll go over how a device-side function is implemented in KHARMA, including both the parts dealing with Kokkos generally, and the pieces specific to KHARMA.

KOKKOS_INLINE_FUNCTION void add_source(const GRCoordinates &G, const VariablePack<Real>& P, const VarMap& m_p, const FourVectors& D,
                      const Real& gam, const int& k, const int& j, const int& i,
                      const VariablePack<Real>& dUdt, const VarMap& m_u)
{
    // Get stuff we don't want to recalculate every loop iteration
    // This is basically a manual version of GRMHD::calc_tensor but saves recalculating e.g. dot(bcon, bcov) 4 times
    Real pgas = (gam - 1) * P(m_p.UU, k, j, i);
    Real bsq = dot(D.bcon, D.bcov);
    Real eta = pgas + P(m_p.RHO, k, j, i) + P(m_p.UU, k, j, i) + bsq;
    Real ptot = pgas + 0.5 * bsq;

    // Contract mhd stress tensor with connection, and multiply by metric determinant
    Real new_du[GR_DIM] = {0};
    DLOOP2 {
        Real Tmunu = (eta * D.ucon[mu] * D.ucov[nu] +
                      ptot * (mu == nu) -
                      D.bcon[mu] * D.bcov[nu]);

        for (int lam = 0; lam < GR_DIM; ++lam)
            new_du[lam] += Tmunu * G.conn(j, i, nu, lam, mu);
    }

    dUdt(m_u.UU, k, j, i) += new_du[0] * G.gdet(Loci::center, j, i);
    VLOOP dUdt(m_u.U1 + v, k, j, i) += new_du[1 + v] * G.gdet(Loci::center, j, i);
}

Since this function doesn't rely on too much from Parthenon, we can describe it more or less in its entirety using what you already know. If you want to follow along with the physics, Tmunu corresponds to the MHD stress-energy tensor given for example in Gammie et al. (2003) eqn. (11), and the function overall computes the source term at the end of eqn. (4).

Declaration

We start with the declaration: KOKKOS_INLINE_FUNCTION void fname(). The macro is defined by Kokkos, and ensures that this function is compiled/defined for both the host and device. When compiling for CUDA, this means declaring the function "host device", and of course also "inline". All such functions in KHARMA are declared in header files, to provide the best opportunity for compiler optimizations and inlining. Generally in KHARMA, we never use functions declared KOKKOS_INLINE on the host side, but it is possible to do so.

Other options for this declaration:

KOKKOS_FUNCTION which removes the inline keyword, allowing the function to be called between object files on the device side. This is generally not recommended.
KOKKOS_FORCEINLINE_FUNCTION, which adds the appropriate compiler tags to force inlining of the following function. May be necessary in places for performance.

Arguments

I'll provide some further detail on the arguments as we get into KHARMA specifics later on, but for now, notice there are three main kinds of arguments:

VariablePack objects. These act like the 4D arrays U,P from iharm3D -- the first index indicates a variable, the latter three a zone in the mesh. However, they are constructed on the fly by calls like PackMHDPrims you might have seen in the host-side function above. They are indexed with (p, k, j, i), which you can think of as an equivalent of [p][k][j][i] for Kokkos arrays: the function call syntax allows the memory layout to be flexible (and allows bounds-checking in debug builds!).
VarMap structs. The VarMap is a KHARMA-specific struct for indexing where all the variables in a VariablePack actually ended up -- that is, it's a collection of indices just like the macros RHO,UU, etc in iharm3D. Except in KHARMA these indices have to be variables, because packs are constructed on the fly. As you can glean from the function definitions, the members of VarMap have their old, macro-style names. You can find the full list of possible names in the file types.hpp. Note that any variable not present in the pack will have index -1.
FourVectors structs. Struct, 4 arrays 4 elements long, ucon, ucov, bcon, bcov representing what you'd expect.

The remaining arguments are the specific heat ratio gam and the zone k,j,i on which to operate.

Implementation

The main thing to note (hopefully) is that the implementation of device-side functions looks pretty similar to the C-based functions in e.g. iharm3D or iharm2d_v4[^3]. Except for the translation []->(), RHO->m_p.RHO, NDIM->GR_DIM, etc., the operations remain the same as in those codes.

The macros DLOOPN are defined in decs.h, and loop variables mu, nu, lam, kap through indices 0-3, in that order. VLOOP does the same for indices 0-2 on index v (in addition to adhering to the usual convention of Greek indices for 4-vectors and Latin for 3-vectors, I've tried to indicate 3-vector loops with "V" and 4-vector loops with "D" where possible).

[^3]: In this case, I've manually inlined the call to mhd_calc to save a few operations, so the effect is a bit less dramatic. Comparing e.g. grmhd/mhd_functions.hpp vs phys.c should give a clearer picture of the translations (un)necessary here.

Tutorials

For more advanced Kokkos usage, there is a set of tutorials at the kokkos-tutorials GitHub page, working through Kokkos features in a more studied and comprehensive way than I can do here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly