Kokkos

Purpose

Kokkos is a programming model aiming to provide performance portability -- that is, code which can be written once, and compiled to make use of a variety of architectures in an efficient way. This means more than just compiling for any architecture, as what was efficient for CPUs is rarely efficient on GPUs, and vice versa. Thus Kokkos improves on straight portability frameworks like OpenACC, by adapting not just the compilation process but the whole implementation.

"Adapting the implementation" is naturally complex -- it involves abstracting operations, so that the programmer no longer determines things like the layout of arrays in memory, or the particular ordering of loops. Instead, these things (along with the syntax transformations for e.g. CUDA vs OpenMP) are handled by Kokkos, through the use of C++ templates.

If this sounds hacky, that's because ultimately it is. Kokkos stretches even an already-powerful language like C++ to its limit, and benefits from the newer C++ standards (C++17 in KHARMA, and additionally from C++20). Ultimately the point is to hide all that complexity: Kokkos lets you write what you want to happen, then uses C++ templates to transform that request into the details of what actually happens. Generally, when compiling KHARMA, if you get long, winding errors involving templates, they are issues with how you are using Kokkos.

Luckily the portion of the Kokkos API that we use is very simple, so there are a limited number of potential errors we'll cover at the end of this article. Generally, if you keep good hygiene about how you are calling loops, your code should end up with a structure very similar to a standard C for loop, with the advantage that it will be able to compile efficiently for any architecture.

Lambdas

The major caveat to the above statement, and the most confusing part of working with Kokkos, is the extensive use of so-called "lambda" functions. These are functions defined on the fly inside other code -- they act like variables. That is, they can be named, assigned, defined, and changed like variables. Most importantly for us, though, they can be passed to functions. When using Kokkos, the body of every loop is a lambda function, which is passed to the enclosing function Kokkos::parallel_for (or its Parthenon wrapper, pmb->par_for). This enclosing function handles copying to the device anything the function needs, and running the function over the given range of parameters.

The great feature of lambda functions, for us, is their ability to "capture" names from the enclosing space. That is, if I define a variable gam inside my (host-side, enclosing) function, I can use it both in the function code (which runs on the CPU or "host"), and in the lambda code (which might be running on the GPU, or "device"). This happens without ever explicitly indicating that I want it to be moved to the GPU! This comes as a relief to veterans of OpenMP offloading, OpenACC, and CUDA, which require very explicit indications of what should happen to enclosing variables when a parallel section begins.

Copying a lot of data to and from the GPU can be slow, but KHARMA always avoids copying large, grid-based data. Only individual scalars and small structs are ever copied, directly, with all the important data left resident on the GPU at all times, in the form of Kokkos's multidimensional arrays, Views.

Views

View objects are host-side handles for device-side N-dimensional arrays. They are both a way of organizing device memory (by providing a function () to access a buffer like an N-dimensional array in e.g. numpy) and a way of managing memory (in that they are reference-counted, and deallocate the attached GPU memory when destroyed). There are host-side View objects as well, but KHARMA does not make much use of them. Generally, Views in KHARMA can only be accessed on the device. However, as long as you access them only within par_for or par_reduce loops, you should be able to forget you're using anything more complex than a numpy NDArray object. Full documentation on Views is here.

Example host-side function

Let's look at an example function, Flux::AddGeoSource from kharma/flux/flux.cpp. This function adds the source term on the RHS of the GRMHD equations, $\sqrt{-g} T_\lambda^\kappa \Gamma^\lambda_{\nu \kappa}$. In most HARM implementations, this is done as a single step alongside the calculation of the flux divergence at zone faces. However, to maintain flexibility in KHARMA, we use a Parthenon function to update the zone centers (Update::FluxDivergence), and then apply the source term afterward (Flux::AddGeoSource). The function is a simple application of a single function to each zone, making it an ideal starting case to explain the basic conventions involved in building a Kokkos kernel in KHARMA.

Don't worry about all of the names here just yet. We'll return to this function and describe the pieces unique to Parthenon (TaskStatus, MeshData, etc) later on.

void Flux::AddGeoSource(MeshData<Real> *md, MeshData<Real> *mdudt, IndexDomain domain)
{
    // Pointers
    auto pmesh = md->GetMeshPointer();
    auto pmb0  = md->GetBlockData(0)->GetBlockPointer();
    auto pkgs = pmb0->packages;
    // Options
    const auto& pars = pkgs.Get("GRMHD")->AllParams();
    const Real gam   = pars.Get<Real>("gamma");

    // All connection coefficients are zero in Cartesian Minkowski space
    // TODO do we know this fully in init?
    if (pmb0->coords.coords.is_cart_minkowski()) return;

    // Pack variables
    PackIndexMap prims_map, cons_map;
    auto P    = md->PackVariables(std::vector<MetadataFlag>{Metadata::GetUserFlag("Primitive")}, prims_map);
    auto dUdt = mdudt->PackVariables(std::vector<MetadataFlag>{Metadata::Conserved}, cons_map);
    const VarMap m_p(prims_map, false), m_u(cons_map, true);

    // EMHD params
    const EMHD::EMHD_parameters& emhd_params = EMHD::GetEMHDParameters(pmb0->packages);
    
    // Get sizes
    IndexRange3 bd = KDomain::GetRange(md, domain);
    auto block = IndexRange{0, P.GetDim(5)-1};

    pmb0->par_for("tmunu_source", block.s, block.e, bd.ks, bd.ke, bd.js, bd.je, bd.is, bd.ie,
        KOKKOS_LAMBDA (const int& b, const int &k, const int &j, const int &i) {
            const auto& G = dUdt.GetCoords(b);
            FourVectors D;
            GRMHD::calc_4vecs(G, P(b), m_p, k, j, i, Loci::center, D);
            // Call Flux::calc_tensor which will in turn call the right calc_tensor based on the number of primitives
            Real Tmu[GR_DIM]    = {0};
            Real new_du[GR_DIM] = {0};
            for (int mu = 0; mu < GR_DIM; ++mu) {
                Flux::calc_tensor(P(b), m_p, D, emhd_params, gam, k, j, i, mu, Tmu);
                for (int nu = 0; nu < GR_DIM; ++nu) {
                    // Contract mhd stress tensor with connection, and multiply by metric determinant
                    for (int lam = 0; lam < GR_DIM; ++lam) {
                        new_du[lam] += Tmu[nu] * G.gdet_conn(j, i, nu, lam, mu);
                    }
                }
            }

            dUdt(b, m_u.UU, k, j, i)           += new_du[0];
            VLOOP dUdt(b, m_u.U1 + v, k, j, i) += new_du[1 + v];
        }
    );
}

The thing we'll focus on here is the call pmb0->par_for(), which is a wrapper around the Kokkos function Kokkos::parallel_for(). Let's unpack it a bit.

Name

The first argument is a string, "grmhd_source", which simply names the function. Kokkos and Parthenon allow naming both functions and arrays (Views), which is surprisingly useful: profiling and error messages can include these names. Generally, KHARMA uses snake_case descriptive names mirroring the enclosing function (e.g. "grmhd_source" for GRMHD::AddSource). Since many functions have more than one kernel, these names cannot necessarily be 1:1 with function names. Also note that while it will not produce a compiler error to name two functions with the same string, it is incredibly confusing and messes with profiling code, so please be careful copy/pasting Kokkos kernel calls!

Bounds

Next, the call includes some even number of integers, representing the desired index bounds. These are inclusive -- that is, serially the last index of this call would map to the construction for(int i=b.is; i <= b.ie; ++i). As Parthenon arrays are zero-indexed with inclusive bounds, a full-array loop for size N1 would look like for(int i=0; i <= N1-1; ++i). This will require some adjustment from people used to C conventions based around <, but a standard had to be chosen, and this was it.

Function

The next argument is a lambda function, consisting of a signature (KOKKOS_LAMBDA(const int& b, const int &k, const int &j, const int &i)) and a body (everything between the curly braces)[^1]. Generally, the last three indices are ordered N3, N2, N1 over the mesh as the indices k, j, i respectively. Primitives or variables are indexed with p, 4-vectors are indexed with mu, and MeshBlocks inside a MeshBlockPack are indexed with b. In this case, we want a loop over the entire MeshData in 3D (this reduces automatically in 1D and 2D, which just set bd.ks = bd.ke = 0 and bd.js = bd.je = 0 respectively). This gives us a function of 4 indices: b, k, j, i.

Following the declaration is the lambda function body, which can make use of any variables from the enclosing function, as long as the names are valid -- that is, in order to use the coordinate object G inside the lambda function, it must first be defined in the enclosing function. This admittedly leads to a lot of boilerplate, and makes it very important to keep to standard naming schemes for variables! That is, if we remember always to call the coordinates G, primitives P, and so on, and KHARMA will stay much more readable. This will be referred to as "pulling out" names, as generally it consists of defining standard names for pieces of the fluid state variable passed to each function in KHARMA. Luckily, for similar operations you can copy/paste the existing boilerplate code from a similar function, and tune it to the variables you need. Don't worry about the types used here (or the extensive use of auto[^2]) just yet -- these are all pieces of Parthenon, described here.

The obejcts P and dUdt are Kokkos Views (technically, Parthenon VariablePack<Real> objects, which are Views of Views. They can be accessed and even assigned like an N-dimensional version of the traditional array, albeit using () instead of []. Generally, KHARMA (and Parthenon) uses 4- or 5-dimensional Views/Packs: 3 spatial dimensions at the right, another out front indicating which variable of a pack is being accessed, and sometimes a preceding index for which mesh block is being treated, when a kernel is written to treat every block on an MPI rank. More on this in the Parthenon section.

NB: most Views in KHARMA cannot be accessed outside of kernels, and arrays declared outside cannot be accessed inside. This is one of the most common causes of non-portable code: when compiled for CPU, accessing Views and arrays anywhere works fine, but this won't hold true for GPU code. This most often comes up when passing a lot of values from the host to the device. Generally, we want to avoid copying a lot of data to/from the GPU for speed. However, if you need to copy a whole array, you can either:

Declare a new View and fill it using device code instead of on the host,
Fill a View on the host and DeepCopy it to the device, or
Pass the values individually, or in a struct rather than an array. Generally, we use (1) when possible, or (3) if the array isn't large. There are some places (2) is not avoidable: sometimes at initialization (see kharma/prob/resize_restart.cpp) or with arrays of unknown size (see CountFlags in kharma/reductions/reductions.cpp).

Finally, note that function calls inside the lambda are in snake_case. These calls can only be made to special functions defined in KHARMA's headers. More on the overall structure here later, but you should remember that anything you're calling inside lambda functions (that is, on the device) should be in snake_case, and anything you're calling outside, on the host, should be in CamelCase. This is a convention specific to KHARMA -- a full list of those can be found on the wiki here.

Reductions

I should also mention another feature, though it's not present in this function: Kokkos calls can also perform reductions (e.g. sum, max, etc). These are indicated by calling pmb->par_reduce() instead of par_for, and adding a Kokkos reduction (e.g. Kokkos::Sum) as the last argument. The lambda function macros for these operations end in _REDUCE and add an argument local_result, to which the lambda should assign values which fit the reduction. This is simpler than it seems: an example can be found in GRMHD::EstimateTimestep in grmhd/grmhd.cpp.

[^1]: For anyone curious, the KOKKOS_LAMBDA macro currently just maps to a standard C++ capture-by-value lambda, [=]. The macro is used for forward- and backward-compatibility.

[^2]: The use of auto for all of these definitions ensures that KHARMA can accommodate name and structure changes to Parthenon's types more easily, as long as new types share the same usage conventions (e.g. operator()) as old types.

Example device-side function

While there is a lot of complexity in enclosing functions, sometimes we want to define an operation we can compose into anything else we need to do, and re-use between many different kernels and loop types. Thus the device-side function. This example is GRMHD::calc_4vecs, which determines the contravariant magnetic field and velocity 4-vectors $b^\mu$, $u^\mu$ and their covariant versions based on the primitive variables (see Gammie et al. for the definitions).

KOKKOS_INLINE_FUNCTION void calc_4vecs(const GRCoordinates& G, const VariablePack<Real>& P, const VarMap& m,
                                      const int& k, const int& j, const int& i, const Loci loc, FourVectors& D)
{
    const Real gamma = lorentz_calc(G, P, m, k, j, i, loc);
    const Real alpha = 1. / m::sqrt(-G.gcon(loc, j, i, 0, 0));

    D.ucon[0] = gamma / alpha;
    VLOOP D.ucon[v+1] = P(m.U1 + v, k, j, i) - gamma * alpha * G.gcon(loc, j, i, 0, v+1);

    G.lower(D.ucon, D.ucov, k, j, i, loc);

    if (m.B1 >= 0) {
        D.bcon[0] = 0;
        VLOOP D.bcon[0]  += P(m.B1 + v, k, j, i) * D.ucov[v+1];
        VLOOP D.bcon[v+1] = (P(m.B1 + v, k, j, i) + D.bcon[0] * D.ucon[v+1]) / D.ucon[0];

        G.lower(D.bcon, D.bcov, k, j, i, loc);
    } else {
        DLOOP1 D.bcon[mu] = D.bcov[mu] = 0.;
    }
}

Since this function doesn't rely on too much from Parthenon, we can describe it more or less in its entirety using what you already know. If you want to follow along with the physics, Tmunu corresponds to the MHD stress-energy tensor given for example in Gammie et al. (2003) eqn. (11), and the function overall computes the source term at the end of eqn. (4).

Declaration

We start with the declaration: KOKKOS_INLINE_FUNCTION void fname(). The macro is defined by Kokkos, and ensures that this function is compiled/defined for both the host and device. When compiling for CUDA, this means declaring the function "__host__ __device__", and of course also inline. All such functions in KHARMA are declared in header files, to provide the best opportunity for compiler optimizations and inlining. Generally in KHARMA, we almost never use functions declared KOKKOS_INLINE on the host side, but it is possible to do so.

Other options for this declaration:

KOKKOS_FUNCTION which removes the inline keyword, allowing the function to be called between object files on the device side. This is generally not recommended.
KOKKOS_FORCEINLINE_FUNCTION, which adds the appropriate compiler tags to force inlining of the following function. May be necessary in places for performance. As of recent versions of Kokkos, there are also declarations for member functions of objects, which previously have been very hard to make work. These are worth a look if you are writing new classes which need to operate on device for some reason.

Arguments

You'll better understand all the names after reading about Parthenon, but for now, notice there are three main kinds of arguments:

GRCoordinates object. G provides cell locations, connection coefficients, and the covariant and contravariant metric gcov/gcon (as well as the metric determinant and other neat coordinate tools and info). As seen in the example, gcon is indexed by cell location, indices j and i corresponding to X2/X1 location (the metric is assumed symmetric in X3/$\phi$), and lastly the particular element of the 4x4 array. KHARMA currently stores all 16 values at each i, j zone despite the redundancy -- even if this is changed in the future, the interface will not be affected.
VariablePack object. This is basically Kokkos View, as described above, and behaves like an array. While these are templated on the type used in the pack, you will only ever see VariablePack<Real> (see the convention).
VarMap structs. The VarMap is a KHARMA-specific struct for recording the starting index of each variable inside a VariablePack -- that is, it's a collection of indices just like the macros RHO,UU, etc in iharm3D. Except, since packs are constructed on the fly, these indices can change from pack to pack. The VarMap of primitive variables is traditionally called m_p and of conserved variables m_u. The naming convention for variables dates from the macros found in earlier versions of HARM: RHO, UU, U1, U2, U3, B1, B2, B3. Preserving continuity with these versions, the same names are used for the conserved variables -- even though, for example, m_u.UU is actually $T^0_0$, the total stress-energy tensor component, not just internal energy. Plenty more variables need indexes when additional physics is enabled: you can find the full list of possible names in the file types.hpp. Any variable not present in the pack will have index -1, so we can check whether a variable is present in a given pack by testing that the index is >= 0.
FourVectors struct. Consists of 4 arrays 4 elements long, ucon, ucov, bcon, bcov, representing what you'd expect.

The remaining arguments are the the zone number k,j,i on which to operate (indexed within the current mesh block), and the particular zone location where the calculation is taking place (usually the zone center, sometimes the zone face when computing fluxes).

Implementation

The main thing to note (hopefully) is that the implementation of device-side functions looks pretty similar to the C-based functions in e.g. iharm3D or iharm2d_v4. Except for the translation []->(), RHO->m_p.RHO, NDIM->GR_DIM, etc., the operations remain the same as in those codes.

The macros DLOOPN are defined in decs.h, and loop variables mu, nu, lam, kap through indices 0-3, in that order. VLOOP does the same for indices 0-2 on index v. G.lower() is just a convenience function for contracting with the covariant metric.

Tutorials

For more advanced Kokkos usage, there is a set of tutorials at the kokkos-tutorials GitHub page, working through Kokkos features in a more studied and comprehensive way than I can do here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly