BLAS

Kokkos Kernels strive to make available most BLAS capabilities to users and even more through extensions. Our native implementations using the Kokkos Core library capabilities are providing portability and performance which allows users to explore the performance of early hardware where no other implementation is available. Additional we support a large third party layer that wraps specialized and high performance implementation of the BLAS specification.

Kokkos objects for linear algebra

Our implementations are heavily leveraging the capabilities of Kokkos::View to simplify the interface of the standard BLAS specification. A Kokkos::View should be thought of as a tensor, depending on its rank it can represent a scalar value, a vector or a matrix. Each rank has an associated extent that specifies the number of values stored by the rank and a stride that records the leading dimension of the rank. Having these encoded in the View means that Kokkos Kernels does not require the user to specify dimension of vector or matrices neither information like leading dimension of matrices. Another information carried by the View is the notion of increment used to step by a given spacing in a vector. That increment is encoded in the layout of the View using a LayoutStride. Below we show the signature of the gemv function of standard BLAS and Kokkos Kernels to illustrate our explanations:

dgemv(character TRANS,
      integer M,
      integer N,
      double precision ALPHA,
      double precision, dimension(lda,*) A,
      integer LDA,
      double precision, dimension(*) X,
      integer INCX,
      double precision BETA,
      double precision, dimension(*) Y,
      integer INCY)

template <class ExecutionSpace, class AViewType, class XViewType, class YViewType>
void gemv(const ExecutionSpace& space,
          const char trans[],
          typename AViewType::const_value_type& alpha,
          const AViewType& A,
          const XViewType& x,
          typename YViewType::const_value_type& beta,
          const YViewType& y)

Now we provide an example of LayoutStride usage to compute the nrm1 of the odd entries and even entries in a vector, of course the example can be adapted for larger increments.

#include "Kokkos_Core.hpp"
#include "KokkosBlas1_nrm1.hpp"

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc, argv);
  {
    using vector = Kokkos::View<double*, Kokkos::DefaultExecutionSpace>;

    // Create a vector of length 20.
    // Fill it with the value 2.
    // Compute its norm1: 20*2 = 40.
    vector a("a", 20);
    Kokkos::deep_copy(a, 2);
    auto nrm1_a = KokkosBlas::nrm1(a);

    // Create a strided layout of length 10 and stride 2
    // Make a new view using the data from vector a and the new layout
    // Set the values in the even vector to 1
    // Compute the norm1 of the even vector: 10*1 = 10
    Kokkos::LayoutStride layout(10, 2);
    Kokkos::View<double*, Kokkos::LayoutStride, Kokkos::DefaultExecutionSpace> even_a(a.data(), layout);
    Kokkos::deep_copy(even_a, 1);
    auto nrm1_even_a = KokkosBlas::nrm1(even_a);

    // Repeat the operations to create an odd vector using a pointer offset and the layout
    // Set the values of the odd vector to 3
    // Compute the norm1 of the odd vector: 10*3 = 30
    Kokkos::View<double*, Kokkos::LayoutStride, Kokkos::DefaultExecutionSpace> odd_a(a.data()+1, layout);
    Kokkos::deep_copy(odd_a, 3);
    auto nrm1_odd_a = KokkosBlas::nrm1(odd_a);

    // Print all the norms out and the values in the original vector.
    // Note that to do this with device views you will need to create
    // mirror views and copy data from device to host with deep_copy.
    std::cout << "nrm1(a)=" << nrm1_a << ", nrm1(even_a)=" << nrm1_even_a << ", nrm1(odd_a)=" << nrm1_odd_a << std::endl;
    std::cout << "a={ ";
    for(int idx = 0; idx < a.extent_int(0); ++idx) {
      std::cout << a(idx) << " ";
    }
    std::cout << "}" << std::endl;
  }
  }
  Kokkos::finalize();
}

Scalar types and precision

Kokkos Kernels relies on template parameters to specify the type of input and output variables, this allows us to have generic implementation that are not precision specific. This is reflected in the name of our functions that do not include the usual S, D, C and D letters that identify the parameters type in reference BLAS. While this makes our naming scheme a little simpler, the use of template parameters allows us to naturally support multi-precision operations. For instance the gemv function above can be called with vectors of double and a matrix of float.

We should make a rule as to which precision will be used for accumulation and stick to it in our implementation. We have at least two possibilities in my mind: use the precision of the output View or use the highest precision of all input parameters and downcast if appropriate when storing the result.

Execution spaces, non-blocking behavior and thread safety

All Kokkos Kernels BLAS calls can be made with an execution space instance as the first parameter. This allows users to attach a stream/queue to the space instance which will be used to launch the BLAS kernel in the appropriate stream/queue. For this strategy to be useful, the BLAS functions are implemented in a non-blocking fashion except when a return value is expected, for instance when computing a norm or a dot product, since we need to obtain the return value before the following kernel can start. Finally while we strive to have a thread safe BLAS implementation, when using an execution space instance to set a select a particular stream/queue to execute a kernel, selecting the queue is not a thread safe operation. More detail on that can be obtained from the vendors documentation: cuBLAS.

BLAS support

Below are tables summarizing the currently supported function calls and third party libraries in Kokkos Kernels. The tables are updated with each release of the library to reflect recently added support.

BLAS 1

BLAS Call	Kokkos Kernels Call	Reference	TPL BLAS	TPL CUBLAS	TPL ROCBLAS	TPL oneMKL	Complex Special
SROTG	rotg(a, b, c, s)	X	X	X	X	--	--
SROTMG	rotmg(d1, d2, x1, y1, param)	X	X	X	X	--	NC
SROT	rot(X, Y, c, s)	X	X	X	X	--	--
SROTM	rotm(X, Y, param)	X	X	X	--	--	NC
SSWAP	swap(X, Y)	X	X	X	X	--	N/A
SSCAL	scal(y,a,x)	X	X	X	X	--	N/A
CSSCAL	scal(y,a,x)	X	X	X	X	--	OC
SCOPY	deep_copy(y,x)	X	--	--	--	--	N/A
SAXPY	axpy(a,x,y)	X	X	X	--	--	N/A
SDOT*	dot(x,y)	X	X	X	X	X	--
SDSDOT*	dot(x,y)	X	X	X	X	X	NC
CDOTU	--	--	--	--	--	--	OC
CDOTC*	dot(x,y)	X	X	X	X	X	OC
SNRM2	nrm2(x)	X	X	X	X	X	NC
SCNRM2	nrm2(x)	X	X	X	X	X	OC
SASUM	nrm1(x)	X	X	X	X	X	N/A
ISAMAX	iamax(x)	X	X	X	X	--	N/A

BLAS 1 extention

Kokkos Kernels Call	Reference	TPL BLAS	TPL CUBLAS	TPL ROCBLAS	TPL oneMKL	Complex Special
axpby	--	--	--	--	--	--
fill	--	--	--	--	--	--
mult	--	--	--	--	--	--
nrminf	--	--	--	--	--	--
nrm2_squared	--	--	--	--	--	--
nrm2w	--	--	--	--	--	--
nrm2w_squared	--	--	--	--	--	--
reciprocal	--	--	--	--	--	--
sum	--	--	--	--	--	--
update	--	--	--	--	--	--

BLAS 2

BLAS Call	Kokkos Kernels Call	Reference	TPL BLAS	TPL CUBLAS	TPL ROCBLAS	TPL oneMKL	Complex Special
SGEMV	gemv(trans,a,A,x,b,y)	X	X	X	X	X	N/A
SGBMV	--	--	--	--	--	--	--
SSYMV	--	--	--	--	--	--	--
SSBMV	--	--	--	--	--	--	--
SSPMV	--	--	--	--	--	--	--
STRMV	trmm(side,uplo,trans, diag,alpha,A,B)	X	X	X	--	--	--
STBMV	--	--	--	--	--	--	--
STPMV	--	--	--	--	--	--	--
STRSV	trsm(side,uplo,trans, diag,alpha,A,B)	X	X	X	--	--	--
STBSV	--	--	--	--	--	--	--
STPSV	--	--	--	--	--	--	--
SGER	ger(trans,a,x,y,A)	X	X	X	X	--	NC
CGERU	ger(trans,a,x,y,A)	X	X	X	X	--	OC
CGERC	ger(trans,a,x,y,A)	X	X	X	X	--	OC
SSYR	syr(trans,uplo,a,x,A)	X	X	X	X	--	--
SSPR	--	--	--	--	--	--	--
SSYR2	syr2(trans,uplo,a,x,y,A)	X	X	X	X	--	--
SSPR2	--	--	--	--	--	--	--

BLAS 3

BLAS Call	Kokkos Kernels Call	Reference	TPL BLAS	TPL CUBLAS	TPL ROCBLAS	TPL oneMKL	Complex Special
SGEMM	gemm(transA,transB,a,A,B,b,C)	X	X	X	X	--	N/A
SSYMM	--	--	--	--	--	--	--
SSYRK	--	--	--	--	--	--	--
SSYR2K	--	--	--	--	--	--	--
CHEMM	--	--	--	--	--	--	OC
CHERK	--	--	--	--	--	--	OC
CHER2K	--	--	--	--	--	--	OC
STRMM	trmm(side, uplo, trans, diag, a, A, B)	X	X	X	--	--	--
STRSM	trsm(side, uplo, trans, diag, a, A, B)	X	X	X	--	--	--

Team Level Functions:

Functions which use a full execution space instance:

abs Element wise absolute value. y[i] = |x[i]|
axpy Vector addition y[i] = y[i] + alpha * x[i]
axpby Vector addition y[i] = beta * y[i] + alpha * x[i]
dot Inner product. delta = SUM_i(y[i] * x[i])
fill Setting each element of a vector. x[i] = alpha
mult Elememt wise multiplication of two vectors y[i] = beta * y[i] + alpha * a[i] * x[i]
nrm1 1-Norm of a vector. nrm1 = SUM_i( |x[i]| )
nrm2 2-Norm of a vector. nrm2 = SQRT(SUM_i( |x[i]| * |x[i]| ))
nrm2_squared Square of a 2-Norm of a vector. nrm2 = SUM_i( |x[i]| * |x[i]| )
nrm2w Weighted 2-Norm of a vector. nrm2 = SQRT(SUM_i( |x[i]| / |w[i]| * |x[i]| / |w[i]| ))
nrm2w_squared Square of a weighted 2-Norm of a vector. nrm2 = SUM_i( |x[i]| / |w[i]| * |x[i]| / |w[i]| )
nrminf Inf-Norm of a vector. nrminf = MAX_i( |x[i]| )
reciprocal Element wise reciprocal of a vector. y[i] = 1/x[i]
scal Element wise multiplication of a vector with a scalar. y[i] = alpha * x[i]
sum Sum of all elements in a vector. sum = SUM_i( x[i] )
update Vector addition z[i] = gamma * z[i] + alpha * x[i] + beta * y[i]

TPL Support for BLAS/LAPACK functions using MAGMA

KokkosKernels allows calling MAGMA functions through KokkosKernels interfaces. The current supported functions are:

gesv Linear equation system solve using LU factorization with multiple right-hand sides A*X = B (partial pivoting variant and no pivoting variant).
- Equivalent MAGMA functions: magma_sgesv_gpu, magma_dsgesv_gpu, magma_cgesv_gpu, magma_zgesv_gpu, magma_sgesv_nopiv_gpu, magma_dsgesv_nopiv_gpu, magma_cgesv_nopiv_gpu, magma_zgesv_nopiv_gpu
trtri Triangular matrix inverse.
- Equivalent MAGMA functions: magma_strtri_gpu, magma_dtrtri_gpu, magma_ctrtri_gpu, magma_ztrtri_gpu.