-
Notifications
You must be signed in to change notification settings - Fork 16
Kokkos
Kokkos is a programming model aiming to provide performance portability -- that is, code which can be written once, and compiled to make use of a variety of architectures in an efficient way. This means more than just compiling for any architecture, as what was efficient for CPUs is rarely efficient on GPUs, and vice versa. Thus Kokkos improves on straight portability frameworks like OpenACC, by adapting not just the compilation process but the whole implementation.
"Adapting the implementation" is naturally complex -- it involves abstracting operations, so that the programmer no longer determines things like the layout of arrays in memory, or the particular ordering of loops. Instead, these things (along with the syntax transformations for e.g. CUDA vs OpenMP) are handled by Kokkos, through the use of C++ templates.
If this sounds hacky, that's because ultimately it is. The Kokkos team has a close relationship with the C++ standards committee. Kokkos is letting you write what you want to happen, then using C++ templates to transform that request into the details of what actually happens. Generally, when compiling KHARMA, if you get long, winding errors involving templates, they are issues with how you are using Kokkos.
Luckily the portion of the Kokkos API that we use is very simple, so there are a limited number of potential errors we'll cover at the end of this article. Generally, if you keep good hygiene about how you are calling loops, your code should end up with a structure very similar to just writing a standard C for loop, with the advantage that it will be able to compile efficiently for any architecture.
The major caveat to the above statement, and the most confusing part of working with Kokkos, is the extensive use of so-called "lambda" functions. These are functions defined on the fly inside other code -- they act like variables. That is, they can be named, assigned, defined, and changed like variables. Most importantly for us, though, they can be passed to functions. When using Kokkos, the body of every loop is a lambda function, which is passed to the enclosing function Kokkos::parallel_for (or its Parthenon wrapper, pmb->par_for). This enclosing function handles copying any launching the lambda on
The great feature of lambda functions, for us, is their ability to "capture" names from the enclosing space. That is, if I define a variable gam
inside my function, I can use it both in the function code (which runs on the CPU or "host"), and in the lambda code (which might be running on the GPU, or "device"). This happens without ever explicitly indicating that I want it to be moved to the GPU! This comes as a relief to veterans of OpenMP offloading, OpenACC, and CUDA, which require very explicit indications of what should happen to enclosing variables when a parallel section begins.
If you're thinking "isn't it slow to copy things to the device before every kernel?" You're not wrong! We try to keep the copying to a minimum with a variety of techniques I'll save for more advanced Kokkos lessons later on.
Let's look at an example function, GRMHD::AddSource
from grmhd/source.cpp
. This function adds the source term on the RHS of the GRMHD equations. In most HARM implementations, this is done as a single step alongside the calculation of the flux divergence at zone faces. However, to maintain flexibility in KHARMA, we use a Parthenon function to update the zone centers (Update::FluxDivergence
), and then apply the source term afterward in GRMHD::AddSource
. The function is a simple application of a single function to each zone, making it an ideal starting case to explain the basic conventions involved in building a Kokkos kernel in KHARMA.
Don't worry about all of the names here just yet. We'll return to this function and describe the Parthenon pieces (TaskStatus
, MeshData
, etc) later on.
TaskStatus GRMHD::AddSource(MeshData<Real> *md, MeshData<Real> *mdudt)
{
FLAG("Adding GRMHD source");
// Pointers
auto pmesh = md->GetMeshPointer();
auto pmb0 = md->GetBlockData(0)->GetBlockPointer();
// Options
const Real gam = pmb0->packages.Get("GRMHD")->Param<Real>("gamma");
// Pack variables
PackIndexMap prims_map, cons_map;
auto P = GRMHD::PackMHDPrims(md, prims_map);
auto dUdt = GRMHD::PackMHDCons(mdudt, cons_map);
const VarMap m_u(cons_map, true), m_p(prims_map, false);
// Get sizes
IndexDomain domain = IndexDomain::interior;
auto ib = md->GetBoundsI(domain);
auto jb = md->GetBoundsJ(domain);
auto kb = md->GetBoundsK(domain);
auto block = IndexRange{0, P.GetDim(5)-1};
const auto& G = dUdt.coords;
pmb0->par_for("grmhd_source", block.s, block.e, kb.s, kb.e, jb.s, jb.e, ib.s, ib.e,
KOKKOS_LAMBDA_MESH_3D {
// Then calculate and add the GRMHD source term
FourVectors Dtmp;
GRMHD::calc_4vecs(G(b), P(b), m_p, k, j, i, Loci::center, Dtmp);
GRMHD::add_source(G(b), P(b), m_p, Dtmp, gam, k, j, i, dUdt(b), m_u);
}
);
FLAG("Added");
return TaskStatus::complete;
}
The big thing here is the call pmb0->par_for()
, which maps to the Kokkos function Kokkos::parallel_for()
. Parthenon wraps it as a part of the class MeshBlockData
, so that each meshblock can be given a different "stream" -- a limited way to run functions on different blocks in parallel on the GPU, which is occasionally useful but isn't being used here. Let's unpack the call a bit.
The first argument is a string, "grmhd_source"
, which simply names the function. Kokkos and Parthenon allow naming both functions and arrays, which is surprisingly useful as profiling and error messages can include these names. Generally, KHARMA uses snake_case descriptive names mirroring the enclosing function (e.g. "grmhd_source"
for GRMHD::AddSource
). Since many functions have more than one kernel, these names cannot necessarily be 1:1 with function names. Also note that while it will not produce a compiler error to name two functions with the same string, it is incredibly confusing and messes with profiling code, so please be careful copy/pasting Kokkos kernel calls.
Next, the call includes some even number of integers, representing the desired index bounds. These are inclusive -- that is, serially the last index of this call would map to the construction for(int i=ib.s; i <= ib.e; ++i)
. As Parthenon arrays are zero-indexed with inclusive bounds, a full-array loop for size N1 would look like for(int i=0; i <= N1-1; ++i)
. This will require some adjustment from people used to c conventions based around <
, but a standard had to be chosen, and this was it.
The next argument is a lambda function, consisting of a signature (KOKKOS_LAMBDA_MESH_3D
) and a body (everything between curly braces). In KHARMA, we name the various possible function signatures using descriptive macros defined near the end of decs.hpp
, and I'd recommend consulting that file quickly to get a sense of the naming conventions for indices[^1]. Generally, the last three indices are ordered N3
, N2
, N1
over the mesh as the indices k
, j
, i
respectively. Primitives or variables are indexed with p
, vectors are indexed with mu
, and MeshBlock
s inside a MeshBlockPack
are indexed with b
(more on these later). In this case, we want a loop over the entire MeshData in 3D (this reduces automatically in 1D and 2D, which just set kb.s == kb.e
and jb.s == jb.e
respectively). This gives us a function with 4 indices: b
, k
, j
, i
.
[^1]: For anyone curious, the KOKKOS_LAMBDA
macro now maps to a standard C++ capture-by-value lambda, [=]
.
Following the declaration is the lambda function body, which can make use of any variables from the enclosing function, as long as the names are valid -- that is, in order to use the coordinate object G
inside the lambda function, it must first be defined in the enclosing function. This admittedly leads to a lot of boilerplate, and makes it very important to keep to standard naming schemes for variables! That is, if we remember always to call the coordinates G
, primitives P
, and so on, and KHARMA will stay much more readable. This will be referred to as "pulling out" names, as generally it consists of defining standard names for pieces of the fluid state variable passed to each function in KHARMA. Luckily, for similar operations you can copy/paste existing "preamble" code and generally have the variables you need. Don't worry about the types used here (or the extensive use of auto
[^2]) just yet -- these are all pieces of Parthenon, which I'll describe in the following section.
[^2]: The use of auto
for all of these definitions ensures that KHARMA can accommodate name and structure changes to Parthenon's types more easily, as long as new types share the same usage conventions (e.g. operator()
) as old types.
Finally, note that function calls inside the lambda are in snake_case
. These calls can only be made to special functions defined in KHARMA's headers. More on the overall structure here later, but you should remember that anything you're calling inside loops (that is, on the device) should be in snake_case
, and anything you're calling on the host should be in CamelCase
.
Finally, I should mention another feature, though it's not present in this function: Kokkos calls can also perform reductions (e.g. sum, max, etc). These are indicated by calling pmb->par_reduce()
instead of par_for
, and adding a Kokkos reduction (e.g. Kokkos::Sum) as the last argument. The lambda function macros for these operations end in _REDUCE
and add an argument local_result
, to which the lambda should assign values which fit the reduction. This is simpler than it seems: a simple example can be found in the function CheckNaN
in debug.cpp
, and a more advanced example in GRMHD::EstimateTimestep
in grmhd/grmhd.cpp
.
Note that