[2.6] What is New #315

hwu36 · 2021-09-07T17:57:29Z

hwu36
Sep 7, 2021
Maintainer

2.6/2.6.1 is finally tagged, we add a lot of new features as well as improve the performance across the board in this release. README and CHANGELOG list all of them. Below adds a little more details

Two new complex epilogue fusion patterns: broadcast and reduction. Broadcast generates two matrices: the first is gemm+bias_vector and the second is gemm+bias_vector+relu. In addition to generate a normal result matrix with customizable elementwise operation, Reduction reduces the the result in N dimension and generate a Mx1 vector. These two can be used in HugeCTR.
Batched GEMV. It works better if the matrix is in the column major.
Much better support of Clang. Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere. Check the instructions.
We use 128 Byte L2 Prefetch hint in all global memory loads.
Provide a method to not fully unroll the epilogue which is useful when the elementwise operation is complex (e.g. gelu) to significantly reduce I-Cache misses. Just need to set a compile time constant kIsHeavy to true. See this example.
We changed the GEMM stride to be 64-bit from 32-bit, the extent is still remained to be 32-bit.
Affine-2 GEMM in which both dimensions of the matrix are not contiguous.
Quaternion (a + bi + cj + dk) gemm and convolution.
CMake flag -DCUTLASS_NAMESPACE=xxx can be used to customize the top level namespace to avoid the namespace conflict with other NVIDIA libraries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.6] What is New #315

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

[2.6] What is New #315

hwu36 Sep 7, 2021 Maintainer

Replies: 0 comments

hwu36
Sep 7, 2021
Maintainer