CUTLASS 3.0 is now available! #787

thakkarV · 2023-01-24T17:45:44Z

thakkarV
Jan 24, 2023
Collaborator

CUTLASS 3.0, as the next major version of the CUTLASS API, brings with it CuTe, a new programming model and backend designed for massively parallel heterogenous agents. Using CuTe, CUTLASS 3.0 provides implementations of GEMM kernels for the NVIDIA Hopper architecture.

CuTe-based layouts and layout algebra
A new GEMM template API that eschews the architecture-centric hierarchy of 2.x in favour of a new conceptual framing. Read more in the 3.0 design documentation.
Support for 4th generation Hopper Tensor Core instructions (WGMMA) through CuTe.
Support for Hopper asynchronous Tensor Memory Accelerator (TMA) instructions and associated transaction barriers through CuTe.
New warp-specialized GEMM kernels targeting Hopper TMA + WGMMA for speed-of-light GEMMs.
New warp-specialized persistent GEMM kernels targeting Hopper TMA + WGMMA.
Support for CUDA Threadblock Clusters and programmatic TMA multicast for greater execution and data locality.
A new way to instantiate default GEMM kernels using CollectiveBuilders that supersede the 2.x DefaultXConfiguration types in favour a metaprogramming based kernel generator functionality. See example 49.
Extensions to the CUTLASS library and profiler to support CUTLASS 3.0 Hopper kernels, and a new format for kernel procedural names.
Announcement: CUTLASS plans to rename the GitHub branch master to main with a future release.

CUTLASS 3.0 introduces a new core library, CuTe, to describe and manipulate tensors of threads and data. CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides Layout and Tensor objects that compactly package the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations.

The core abstractions of CuTe are hierarchically multidimensional layouts which can be composed with data arrays to represent tensors. The representation of layouts is powerful enough to represent nearly everything we need to implement efficient dense linear algebra. Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.

CUTLASS 3.0 adopts CuTe throughout the GEMM hierarchy in its templates. This greatly simplifies the design and improves code composability and readability. More documentation specific to CuTe can be found in its dedicated documentation directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUTLASS 3.0 is now available! #787

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

CUTLASS 3.0 is now available! #787

thakkarV Jan 24, 2023 Collaborator

Replies: 0 comments

thakkarV
Jan 24, 2023
Collaborator