README > Fundamental Types
CUTLASS defies several fundamental numeric and container classes upon which computations and algorithms algorithms for linear algebra computations are implemented.
Where possible, CUTLASS fundamental types mirror the C++ Standard Library. However, there are circumstances that necessitate divergence from the Standard Library's specification. In such cases, the CUTLASS implementation adopts unique capitalization to distinguish that standard vocabulary types may not be safely substituted in all cases.
Most types in CUTLASS are usable in both host code and device code. Moreover, they are functional regardless of compute capability, but they may only be efficient when hardware support is present.
CUTLASS defines classes for the following numeric data types.
half_t
: IEEE half-precision floating point (exponent: 5b, mantissa: 10b; literal suffix_hf
)bfloat16_t
: BFloat16 data type (exponent: 8b, mantissa: 7b; literal suffix_bf16
)tfloat32_t
: Tensor Float 32 data type (exponent: 8b, mantissa: 10b; literal suffix_tf32
)int4_t
,uint4_t
: 4b signed and unsigned integer (literal suffx_s4
,_u4
)bin1_t
: 1b binary numeric type (literal suffix_b1
)complex<T>
: defines complex-valued data type based on the supplied real-valued numeric type
Numeric types in CUTLASS may be used in both host and device code and are intended to function like any other plain-old-data type.
If CUTLASS is compiled with CUTLASS_F16C_ENABLED
, then hardware conversion is used for
half-precision types in host code. Regardless, cutlass::half_t
uses the most efficient
NVIDIA GPU hardware instructions available in device code.
Example:
#include <iostream>
#include <cutlass/numeric_types.h>
__global__ void kernel(cutlass::half_t x) {
printf("Device: %f\n", float(x * 2.0_hf));
}
int main() {
cutlass::half_t x = 0.5_hf;
std::cin >> x;
std::cout << "Host: " << 2.0_hf * x << std::endl;
kernel<<< dim3(1,1), dim3(1,1,1) >>>(x);
return 0;
}
CUTLASS uses the following containers extensively for implementing efficient CUDA kernels.
template <
typename T, // element type
int N // number of elements
>
class Array;
Array<class T, int N>
defines a statically sized array of elements of type T and size N. This class is similar to
std::array<>
in the Standard Library with two notable exceptions:
- constructors for each element may not be called
- partial specializations exist to pack or unpack elements smaller than one byte.
Array<>
is intended to be a convenient and uniform container class to store arrays of numeric elements regardless of data type or vector length. The storage needed is expected to be the minimum necessary given the logical size of each numeric type in bits (numeric types smaller than one byte are densely packed). Nevertheless, the size reported by sizeof(Array<T, N>)
is always an integer multiple of bytes.
Storing numeric elements in a C++ STL-style container class enables useful modern C++ mechanisms such as range-based for loops. For example, to print the elements of Array<>
, the following range-based for loop syntax is always valid regardless of numeric data type, compute capability, or context in host or device code.
Example:
int const kN;
Array<T, kN> elements;
CUTLASS_PRAGMA_UNROLL // required to ensure array remains in registers
for (auto x : elements) {
printf("%d, %f", int64_t(x), double(x)); // explictly convert to int64_t or double
}
When copying Array<>
objects or passing them as arguments to methods, it is best to avoid accessing individual elements. This enables the use of vector instructions to perform the operation more efficiently. For example, setting all elements to zero is best performed by calling the clear()
method. Copies should be performed by assigning the entire object.
Example:
#include <cutlass/array.h>
int const kN;
Array<T, kN> source;
Array<T, kN> destination;
source.clear(); // set all elements to value of zero
destination = source; // copy to `destination`
Array<>
may be used to store elements smaller than one byte such as 4b integers.
Array<int4b_t, 2> packed_integers;
static_assert(
sizeof(packed_integers) == 1,
"Packed storage of sub-byte data types is compact.");
// Access array elements using usual indirection and assignment operators
packed_integers[0] = 2_s4;
packed_integers[1] = 3_s4;
CUTLASS_PRAGMA_UNROLL
for (auto x : elements) {
printf("%d", int(x)); // access elements normally
}
template <
typename T, // element type
int N, // number of elements
int Alignment // alignment requirement in bytes
>
class AlignedArray;
AlignedArray
is derived from Array<T, N>
and supports an optional alignment field. Pointers to objects of type AlignedArray<>
reliably yield vectorized memory accesses when dereferenced.
Example:
int const kN = 8;
ArrayAligned<half_t, kN> source;
ArrayAligned<half_t, kN> const *ptr = ...;
source = *ptr; // 128b aligned memory access
template <
typename T, // element type
int N, // number of elements
int Alignment // alignment requirement in bytes
>
class AlignedBuffer;
AlignedBuffer
provides a uniform way to define aligned memory allocations for all data types. This is particularly
useful in defining allocations within shared memory with guaranteed memory alignment needed for vectorized access.
Note, constructors of the elements within AlignedBuffer<> are not called, and so the elements are initially in an
undefined state.
Use AlignedBuffer<>::data()
to obtain a pointer to the first element of the buffer.
Example: Guaranteed aligned shared memory allocation. Note, shared memory contents are uninitialized.
int const kN = 32;
int const kAlignment = 16; // alignment in bytes
// Define a shared memory allocation in device code
__shared__ AlignedBuffer<complex<half_t>, kN, kAlignment> matrix_tile;
complex<half_t> *ptr = matrix_tile.data(); // ptr is guaranteed to have 128b (16 Byte) alignment
Note, AlignedBuffer<>
only guarantees that its internal memory allocation is aligned, obtained by AlignedBuffer<>::data()
. There is no guarantee that the AlignedBuffer<>
object itself satisfies alignment constraints or that its internal memory allocation is contiguous. Device code performing vectorized memory accesses should use the AlignedArray<>
type.
Example: Vectorized memory access to shared memory allocations.
int const kN = 1024;
__shared__ AlignedBuffer<half_t, kN> smem_buffer;
AlignedArray<half_t, 8> *ptr = reinterpret_cast<AlignedArray<half_t, 8> *>(smem_buffer.data());
AlignedArray<half_t, 8> x = ptr[threadIdx.x]; // 128b shared memory load
CUTLASS defines procedures for performing numeric conversion between data types in cutlass/numeric_conversion.h
.
Where possible, these target hardware acceleration on the target architecture and support multiple rounding modes.
#include "cutlass/numeric_conversion.h"
#include "cutlass/numeric_types.h"
NumericConverter<half_t, float> convert_f32_to_f16;
NumericConverter<tfloat32_t, float> convert_f32_to_tf32;
half_t x = convert_f32_to_f16(3.14159f);
tfloat32_t y = convert_f32_to_tf32(3.14159f);
Recent GPU architectures such as NVIDIA Turing and Ampere combine numeric conversion with efficient packing
into bit vectors. Consequently, CUTLASS defines conversion on both scalars and Array<>
objects to implement
the optimal code sequence on all architectures.
//
// Example: convert and pack 32b signed integers to a vector of packed signed 8-bit integers.
//
int const kN = 16;
Array<int8_t, kN> destination;
Array<int, kN> source;
NumericConverter<descltype(destination), decltype(source)> convert;
destination = convert(source);
template <
int Rank,
typename Index = int
>
class Coord;
Coord<Rank, class T = int>
is a container used explicitly for defining logical coordinates in tensors of known rank. Traditional vector operators are defined such as +
, -
, and scalar multiplication *
to simplify the creation of vector-valued expressions on tensor coordinates.
Example: Vector operations on coordinates.
Coord<2> compute_offset(Corod<2> const & base) {
Coord<2> stride = make_Coord(1, kM);
return base + stride * make_Coord(threadIdx.x, threadIdx.y);
}
Instances of Coord<>
are used throughout CUTLASS to compute indices into tensors. Frequently, the dimensions of tensors of known layouts may be given names such as "rows" or "columns". To clarify the code, we have implemented several classes derived from Coord<>
with accessors for each coordinate member.
Such classes include:
struct MatrixCoord : public Coord<2> {
Index & row();
Index & column();
};
and
struct Tensor4DCoord : public Coord<4> {
Index & n();
Index & h();
Index & w();
Index & c();
};
PredicateVector<int Bits>
contains a statically sized array of hardware predicates packed into registers to enable efficient access within unrolled loops.
This container is optimized for sequential access through iterators, though these are only efficient when used within fully unrolled loops.
Moreover, instances of PredicateVector<>
are not guaranteed to be updated until any non-const iterator objects have gone out of scope. This is because iterators are effectively caches that update the PredicateVector<>
instance's internal storage as a batch.
Example: Managing an array of predicates.
unsigned mask;
PredicateVector<kBits> predicates;
// Nested scope to update predicates via an iterator
{
auto pred_it = predicates.begin();
CUTLASS_PRAGMA_UNROLL
for (int bit = 0; bit < kBits; ++bit, ++pred_it) {
bool guard = (mask & (1u << bit));
pred_it.set(guard);
}
}
// Efficient use of predicates to guard memory instructions
T *ptr;
Array<T, kAccesses> fragment;
auto pred_it = predicates.const_begin();
for (int access = 0; access < kAccesses; ++access, ++pred_it) {
if (*pred_it) {
fragment[access] = ptr[access];
}
}
Note: PredicateVector<>
is not efficient when accessed via dynamic random access. If an array of bits is needed with dynamic random access (in contrast with access via constexpr indices), then Array<bin1_t, N>
should be used instead.
CUTLASS defines function objects corresponding to basic arithmetic operations modeled after C++ Standard Library's <functional>
header.
CUTLASS extends this by defining multiply_add<T>
which computes d = a * b + c
. The partial specialization multiply_add<complex<T>>
computes complex-valued multiplication and addition using four real-valued multiply-add operations; these may correspond to native hardware instructions.
Example:
complex<float> a;
complex<float> b;
complex<float> c;
complex<float> d;
multiply_add<complex<float>> mad_op;
d = mad_op(a, b, c); // four single-precision multiply-add instructions
CUTLASS defines partial specializations for type Array<T, N>
, performing elementwise operations on each element. A further partial specialization for Array<half_t, N>
targets may target native SIMD instructions for compute capability SM60 and beyond.
Example: Fused multiply-add of arrays of half-precision elements.
static int const kN = 8;
Array<half_t, kN> a;
Array<half_t, kN> b;
Array<half_t, kN> c;
Array<half_t, kN> d;
multiply_add<Array<half_t, kN>> mad_op;
d = mad_op(a, b, c); // efficient multiply-add for Array of half-precision elements
Operators are define to convert between numeric types in numeric_conversion.h
. Conversion operators are defined in
terms of individual numeric elements and on arrays which enable the possibility of efficient hardware
support on current and future NVIDIA GPUs.
Example: Converting between 32-b and 8-b integers.
Copyright (c) 2017-2021, NVIDIA CORPORATION. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted
provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this list of
conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of
conditions and the following disclaimer in the documentation and/or other materials
provided with the distribution.
* Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
to endorse or promote products derived from this software without specific prior written
permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.