Batched BLAS and LAPACK

Motivation

The Kokkos batched interface provides multiple functor-level interfaces for dense linear algebra (DLA), which correspond to Kokkos hierarchical parallelism. Unlike other batched BLAS and LAPACK interfaces⁰., we do not provide a front-level (or subroutine) interface that launches a streaming parallel kernel. Instead, we provide a functor-level interface that can be used in Kokkos parallel expressions (e.g., parallel for, reduce and scan). The advantage of this approach is that a user can compose various batched dense linear algebra kernels, exploiting temporal locality via the functor-level interfaces. For example, consider a case where small element matrices are created via dgemm and those matrices are triangular solved by lu and trsv. An approach using conventional batched BLAS interfaces would be like the following pseudo code:

int N = 1000; /// # of batch items
int m = 8;    /// a square matrix size

Kokkos::View<double***> AA("AA", N, m, m); /// element matrices
Kokkos::View<double**>  LL("LL", N, m);    /// left basis vectors
Kokkos::View<double**>  RR("RR", N, m);    /// right basis vectors
Kokkos::View<double**>  BB("BB", N, m);    /// load vector and would be overwritten by a solution

batched_dgemm(LL, RR, AA);     /// construct element matrices via batched dgemm
batched_dgetrf(AA);            /// perform lu decomposition of each instance of A matrix array
batched_dtrsv('Lower', AA, BB) /// perform forward substitution
batched_dtrsv('Upper', AA, BB) /// perform backward substitution

Clearly, a performance problem of the above code comes from the fact that the sequence of batched functions does not exploit temporal locality between DLA functions; each batched function sweeps the entire set of batched matrices or vectors in parallel for a single DLA operation. On the other hand, Kokkos batched APIs provide functor-level interfaces so that a user can compose a new batched function. The following example has the same functionality as the above:

int N = 1000; /// # of batch items
int m = 8;    /// a square matrix size

Kokkos::View<double***> AA("AA", N, m, m); /// element matrices
Kokkos::View<double**>  LL("LL", N, m);    /// left basis vectors
Kokkos::View<double**>  RR("RR", N, m);    /// right basis vectors
Kokkos::View<double**>  BB("BB", N, m);    /// load vector and would be overwritten by a solution

using namespace KokkosBatched;
Kokkos::parallel_for(N, KOKKOS_LAMBDA(const int i) {
  auto A = Kokkos::subview(AA, i, Kokkos::ALL(), Kokkos::ALL()); /// ith matrix 
  auto L = Kokkos::subview(LL, i, Kokkos::ALL());                /// ith left vector
  auto R = Kokkos::subview(RR, i, Kokkos::ALL());                /// ith right vector
  auto B = Kokkos::subview(BB, i, Kokkos::ALL());                /// ith load/solution vector

  SerialGemm<Trans::NoTranspose,Trans::NoTranspose,Algo::Gemm::Unblocked>
    ::invoke(one, L, R, zero, A);
  SerialLU<Algo::LU::Unblocked>
    ::invoke(A);
  SerialTrsv<Uplo::Lower,Trans::NoTranspose,Diag::UnitDiag,Algo::Trsv::Unblocked>
    ::invoke(one, A, B);
});

In this example, a single parallel_for is launched to compute a sequence of DLA operations i.e., gemm, lu and trsv. Then, one may ask, "How is this different from putting BLAS and LAPACK functions inside a parallel loop ?". The main difference is that Kokkos batched APIs are very lightweight generic implementations focusing on small matrix sizes (kernels are developed and tuned from the application context). Most vendor provided DLA libraries such as Intel MKL and NVIDIA CUBLAS perform well for large problem sizes. For tiny problem sizes like 3x3 or 5x5 matrix problems, it is not feasible to use vendor optimized DLA libraries as even error checking already puts a quite amount of overhead for such tiny problem sizes. Furthermore, CUBLAS (or GPU numeric libraries) cannot be nested inside of a parallel region, which is why CUBLAS provides separate batched APIs. Kokkos batched APIs provide generic header-only functor-level implementations that can be embedded in a parallel region. The APIs can be mapped to Kokkos hierarchical parallelism and also provide various implementations of algorithms with a template argument, which allows users to choose (or customize) the batched routines for their application context.

Structure of Interface

In this section, we describe the functor-level parallel interface and Kokkos view interface. The following subsections include short pseudo codes as an example and complete working examples are seperately documented in the example section.

Array of Matrices and Vectors (Multidimensional Arrays)

Kokkos provides a multidimensional array object, called a View. A Kokkos view has a template argument called array layout to determine the access pattern of the view. The following example shows how a 3d view is indexed according to a different array layout.

/// AA(i,j,k) is mapped to AA.data()[i*AA.stride(0)+j*AA.stride(1)+k*AA.stride(2)]

Kokkos::View<double***,Kokkos::LayoutRight> AA("AA", p, m, n);
/// The most right dimension has the unit stride (default on HostSpace) 
/// AA.stride(0) = m*n; AA.stride(1) = n; AA.stride(2) = 1;

Kokkos::View<double***,Kokkos::LayoutLeft> AA("AA", p, m, n);
/// The most left dimension has the unit stride (default on GPUs)
/// AA.stride(0) = 1; AA.stride(1) = p; AA.stride(2) = p*m;

To extract a matrix object, we use a subview (a slice) of a view.

auto A = Kokkos::subview(A, i, Kokkos::ALL(), Kokkos::ALL());

When Kokkos::LayoutRight is used, A follows a row major order where A.stride(0) = n and A.stride(1) = 1. On the other hand, when Kokkos::LayoutLeft is used, the slice A is strided where A.stride(0) = p and A.stride(1) = p*m. The standard BLAS and LAPACK interface requires either of the strides should be one and arbitrary strides needs to be repacked in order to interface BLAS and LAPACK libraries. As seen in this example, arbitrarily strided 2d view (matrix) can be created as a result of slicing a multidimensional array. Thus, Kokkos batched functor-level interface should allow arbitrary column and row strides¹.

Collection of Algorithms

As we target for small matrices, there are different algorithms and different implementations that can be more benefitial for a specific problems. Embracing such a case, Kokkos batched functions take a tag parameter for an algorithm. For an example of tiny problem sizes, any overhead (including function call and error checking) degrade performance if third party library is interfaced. Then, it can use a hand-written version of the code². If the problem size is bigger than a cache size, using the standard BLAS and LAPACK within a parallel region would perform better than our inlined version. Provided algorithm tags are Unblocked (naive translation of math functions), Blocked (attempt to exploit register blocking), MKL (interface to the standard MKL functions) and CompactMKL (interface to compact MKL functions)³.

Execution Policy and Hierarchically Parallel Interface

An execution policy descirbes how a parallel pattern should be executed. Kokkos provides two parallel execution policies i.e., Kokkos::RangePolicy and Kokkos::TeamPolicy. A range policy executes a functor object in parallel for each work item within the range described by the policy. A single thread is mapped to processing a single work item. A team policy maps a group of threads (called team) to a single work item and all member threads in a team run concurrently. Team threads can access to team shared memory (scratchpad memory) and perform team collective operations such as barrier and reduce operations. The Kokkos team policy provides two nested parallel interface (a 2D thread grid) team and vector. Corresponding to various thread mapping cases, Kokkos batched APIs provide the following interfaces.

1. Serial interface can be used together with a range policy where each thread process a single work item. The interface does not require a member type and mimics the standard BLAS and LAPACK interface. An example usage is described as follows.

/// Let's do a batched gemm C(:,i,j) += A(:,i,k)*B(:,k,j) using the serial interface
typedef Kokkos::DefaultExecutionSpace SpT;

const double alpha(1), beta(0);

Kokkos::View<double***,SpT> AA("AA", N, m, m);
Kokkos::View<double***,SpT> BB("BB", N, m, m);
Kokkos::View<double***,SpT> CC("CC", N, m, m);

using namespace KokkosBatched::Experimental;
Kokkos::parallel_for(Kokkos::RangePolicy<SpT>(N), KOKKOS_LAMBDA(const int &idx) {
  auto A = Kokkos::subview(AA, idx, Kokkos::ALL(), Kokkos::ALL());
  auto B = Kokkos::subview(BB, idx, Kokkos::ALL(), Kokkos::ALL());
  auto C = Kokkos::subview(CC, idx, Kokkos::ALL(), Kokkos::ALL());
  SerialGemm<Trans::NoTranspose,Trans::NoTranspose,Algo::Gemm::Unblocked>
    ::invoke(alpha, A, B, beta, C);
});

Using the Kokkos polymorphic array layout (a different layout is used by default for CPUs and GPUs), the range policy would map a thread to a block of memory for CPUs while it allows coalesced memory access for GPUs. As Kokkos batched APIs supports arbitrary strides, the serial interface works. Note that the conventional BLAS and LAPACK interface that supports only column-major and row-major ordering does not work with Kokkos layout left.

2. When a team policy is used, there are three nested parallel APIs i.e., Team, Vector and TeamVector. Provide nested team parallel interface, we must consider a way to interact with users' code. For instance, vector-level APIs can be nested within users' team parallel region. Team parallel region can be nested within users' vector parallel region. The following manufactured example will show different use cases of our team-level interfaces from a user perspective.

/// Let's do a batched gemm C(:,i,j) += A(:,i,k)*B(:,k,j) using different interfaces
typedef Kokkos::DefaultExecutionSpace SpT;

const double alpha(1), beta(0);

/// Now multiple threads are mapped to a single problem.
/// Threads in a team should be properly mapped to a blocked memory space.
Kokkos::View<double***,Kokkos::LayoutRight,SpT> AA("A", N, m, m);
Kokkos::View<double***,Kokkos::LayoutRight,SpT> BB("B", N, m, m);
Kokkos::View<double***,Kokkos::LayoutRight,SpT> CC("C", N, m, m);

using namespace KokkosBatched::Experimental;
const int num_leagues = N;   /// N teams are formed
const int team_size = ts;
const int vector_size = vs;  /// ts * vs concurrent threads are associated within a team

Kokkos::TeamPolicy<SpT> policy(N, ts, vs);
typedef Kokkos::TeamPolicy<SpT>::member_type member_type;

/// case a) TeamVector interface already uses two-level nested parallelism internally.
///         Users cannot put this inside of their parallel loop.
{
  Kokkos::parallel_for(policy, KOKKOS_LAMBDA(const member_type &member) {
    const int idx = member.league_rank();
    auto A = Kokkos::subview(AA, idx, Kokkos::ALL(), Kokkos::ALL());
    auto B = Kokkos::subview(BB, idx, Kokkos::ALL(), Kokkos::ALL());
    auto C = Kokkos::subview(CC, idx, Kokkos::ALL(), Kokkos::ALL());
    /// Internally it uses both team/vector parallel loop.
    /// This cannot be nested inside of users' parallel loop.
    TeamVectorGemm<member_type,Trans::NoTranspose,Trans::NoTranspose,Algo::Gemm::Unblocked>
      ::invoke(member, alpha, A, B, beta, C);
  });
}
/// case b) Vector interface is nested in users' team parallel loop.
{
  Kokkos::parallel_for(policy, KOKKOS_LAMBDA(const member_type &member) {
    const int idx = member.league_rank();
    auto A = Kokkos::subview(AA, idx, Kokkos::ALL(), Kokkos::ALL());
    auto B = Kokkos::subview(BB, idx, Kokkos::ALL(), Kokkos::ALL());
    auto C = Kokkos::subview(CC, idx, Kokkos::ALL(), Kokkos::ALL());
    Kokkos::parallel_for(Kokkos::TeamThreadRange(member, m), [&](const int &k1) {
      auto x = Kokkos::subview(B, Kokkos::ALL(), k1);
      auto y = Kokkos::subview(C, Kokkos::ALL(), k1);
      Kokkos::parallel_for(Kokkos::ThreadVectorRange(member, m), [&](const int &k0) {
        x(k0) *= user_want_do_something;
      });
      /// Internally it uses vector parallel loop and cannot be nested inside of vector parallel loop
      VectorGemv<member_type,Trans::NoTranspose,Algo::Gemv::Unblocked>
        ::invoke(member, alpha, A, x, beta, y);
  });
}
/// case c) Team interface is nested in users' vector loop.
{
  Kokkos::parallel_for(policy, KOKKOS_LAMBDA(const member_type &member) {
    const int idx = member.league_rank();
    auto A = Kokkos::subview(AA, idx, Kokkos::ALL(), Kokkos::ALL());
    auto B = Kokkos::subview(BB, idx, Kokkos::ALL(), Kokkos::ALL());
    auto C = Kokkos::subview(CC, idx, Kokkos::ALL(), Kokkos::ALL());
    Kokkos::parallel_for(Kokkos::ThreadVectorRange(member, m), [&](const int &k0) {    
      auto xt = Kokkos::subview(B, Kokkos::ALL(), k0);
      auto yt = Kokkos::subview(C, Kokkos::ALL(), k0);
      Kokkos::parallel_for(Kokkos::TeamThreadRange(member, m), [&](const int &k1) {
        xt(k1) *= user_want_do_something;
      });
      /// Internally it uses vector parallel loop and cannot be nested inside of vector parallel loop
      TeamGemv<member_type,Trans::Transpose,Algo::Gemv::Unblocked>
        ::invoke(member, alpha, A, xt, beta, yt);
  });
}

Readers can consider that the case (a) and (b) are natural use cases that can be arising from applications. However, the case (c) that changes nesting order of team and vector parallel loops may seem to be strange. We explain this on the context of warp-parallel SIMD (or one can think it as SIMT) in the next section.

Implicit and Explicit Vectorization via SIMD

Kokkos batched APIs include a SIMD type. Our SIMD type is simple that the object includes a fixed size of an array object matching the hardware vector units. All arithmetic operations between SIMD and built-in data type are overloaded by vector intrinsics such as Intel AVX2 and AVX512. The main reason that we consider this approach is to effectively vectorize a code for tiny matrix problems that cannot be easily vectorized by a compiler (see intel compact page and paper). A SIMD type can be defined and used as follows.

using namespace KokkosBatched::Experimental;
constexpr int vl = DefaultVectorLength<double,Kokkos::DefaultHostExecutionSpace>::value;
typedef Vector<SIMD<double>,vl> simd_type;

/// A view with simd type
Kokkos::View<simd_type***,Kokkos::DefaultHostExecutionSpace> AA("A", N/vl+(N%vl>0), m, m);
/// use a helper to access the view as like a normal view
auto Ac = createSimdViewAccess<PackDim<0> >(AA); /// Ac(i,j,k) is equivalent to AA(i/vl,j,k)[i%vl]

/// Parallel launch with SIMD view (thread parallel)
Kokkos::parallel_for(Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>(N/vl+(N%vl>0)),
	             KOKKOS_LAMBDA(const int &i) {
  auto A = Kokkos::subview(AA, i, Kokkos::ALL(), Kokkos::ALL());
  /// The following code is purely vectorized
  SerialLU<Algo::LU::Unblocked>::invoke(A, epsilon);
});

Using a SIMD type for tiny problem sizes enables effective vectorization for tiny matrix problems. However, this approach is not portable to GPUs as it is. While a code on most host architectures is vectorized by a compiler, vectorization on GPUs should be explicitly made. In other words, a GPU code is vectorized by mapping contiguous threads IDs to a contiguous memory span. Since we already have the contiguous data array from the SIMD type, we need to figure out how to map a contiguous thread IDs to a SIMD object.

using namespace KokkosBatched::Experimental;

/// By default, we provide a default vector length but any vector length  works fine on GPUs  
/// Note that we distinctly use vector length (SIMD length) from vector size (team policy)
constexpr int vl = DefaultVectorLength<double,Kokkos::Cuda>::value; 
typedef Vector<SIMD<double>,vl> simd_type;

/// A simd 3d view for CPUs and its scalar 4d view wrapping for GPUs
Kokkos::View<simd_type***,Kokkos::LayoutRight,SpT> AA("A", N/vl+(N%vl>0), m, m);
Kokkos::View<double****,Kokkos::LayoutRight,SpT> As(AA.data(),
                                  AA.extent(0), AA.extent(1), AA.extent(2), vl);

/// use a helper to access the view as like a normal view
auto Ac = createSimdViewAccess<PackDim<0> >(AA); /// Ac(i,j,k) is equivalent to AA(i/vl,j,k)[i%vl]

/// Parallel launch with SIMD view (thread parallel)
const int num_leagues = N/vl+(N%vl>0); /// we use Kokkos::AUTO for team_size
const int vs = any_number_smaller_than_warpsize; /// you can also pick vl

Kokkos::TeamPolicy<Kokkos::Cuda> policy(num_leagues, Kokkos::AUTO, vs); 
/// A scalar 4d view wrapping AA
Kokkos::parallel_for(policy, KOKKOS_LAMBDA(const member_type &member) {
  const int i = member.league_rank();
  Kokkos::parallel_for(Kokkos::ThreadVectorRange(member, vl), [&](const int &v) {
    auto A = Kokkos::subview(As, i, Kokkos::ALL(), Kokkos::ALL(), vl);
    /// The vector parallel for maps a contiguous thread IDs to the SIMD mmeory span
    /// As we are already in a vector parallel region, we use team parallel interface
    TeamLU<Algo::LU::Unblocked>::invoke(A, epsilon);
  });
});

This GPU code is vectorized and allows coalesced memory access even for tiny problem sizes though a SIMD type. A last question would be whether we can maintain a single version of the code. The above code can be used for both CPUs and GPUs with small modifications. A unified version is illustrated as follows.

using namespace KokkosBatched::Experimental;

typedef Kokkos::DefaultExecutionSpace SpT; /// either OpenMP or Cuda
constexpr int vl = DefaultVectorLength<double,SpT>::value; 
typedef Vector<SIMD<double>,vl> simd_type;

/// A simd 3d view for CPUs and its scalar 4d view wrapping for GPUs
Kokkos::View<simd_type***,Kokkos::LayoutRight,SpT> AA("A", N/vl+(N%vl>0), m, m);
Kokkos::View<double****,Kokkos::LayoutRight,SpT> As(AA.data(),
                                  AA.extent(0), AA.extent(1), AA.extent(2), vl);

/// Fill interface to use the view as like a standard view
auto Ac = createSimdViewAccess<PackDim<0> >(AA); 

/// Parallel launch with SIMD view
const int num_leagues = N/vl+(N%vl>0); 
const int vs = std::is_same<SpT,Kokkos::Cuda>::value ? any_number_smaller_than_warpsize : 1;

Kokkos::TeamPolicy<SpT> policy(num_leagues, Kokkos::AUTO, vs); 
Kokkos::parallel_for(policy, KOKKOS_LAMBDA(const member_type &member) {
  const int i = member.league_rank();
  Kokkos::parallel_for(Kokkos::ThreadVectorRange(member, vl), [&](const int &v) {
    /// variable definition (matrix extraction)
#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
    /// SIMD view extraction
    auto A = Kokkos::subview(AA, i, Kokkos::ALL(), Kokkos::ALL());     
#else
    /// scalar interleaved view extraction
    auto A = Kokkos::subview(As, i, Kokkos::ALL(), Kokkos::ALL(), vl); 
#endif

    /// use the same numeric implementation
    TeamLU<Algo::LU::Unblocked>::invoke(A, epsilon);
  });
});

First, we need to choose a different vector size for a different device code. As a host architecture relies on a vector instruction, we use vs = 1. On the other hand, a user should pick a specific vector size, which corresponds to threadDim.x in CUDA notations (FYI: a Kokkos team size corresponds to threadDim.y). Next, the functor body is separated into two parts. One is variable (matrix and vector) definition; it is either a SIMD view or a strided view. The other part is numeric operation that remains the same for both CPUs and GPUs. One can say that this is not really a single version of codes. However, the corresponding modifications in the functor body are restricted to variable definitions and algorithm expressions (which we would be likely to do more mistakes if we maintain different versions for different architectures) remain the same. Although a single version of numeric code is used, the code is properly vectorized as we use a SIMD type for both devices.

List of Available Functions

BLAS and LAPACK

Vector SIMD

Footnotes

0: As an example of the standard batched BLAS interface, here we refer to Intel batched gemm and CUBLAS batched gemm.

1: The design choice of the standard BLAS and LAPACK libraries, which supports column or row major ordering only, may need to be addressed to get more effectively used for broader applications e.g., tensors. There also exist libraries that already supports arbitrary strides libflame and blis.

2: We provide DLA functions based on application needs. At this moment, our code is used to compute small block matrix problems where the entire matrix can fit into a cache memory.

3: Intel Compact Routines and SC17 paper.

SAND2018-14051 W