You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2.6/2.6.1 is finally tagged, we add a lot of new features as well as improve the performance across the board in this release. README and CHANGELOG list all of them. Below adds a little more details
Two new complex epilogue fusion patterns: broadcast and reduction. Broadcast generates two matrices: the first is gemm+bias_vector and the second is gemm+bias_vector+relu. In addition to generate a normal result matrix with customizable elementwise operation, Reduction reduces the the result in N dimension and generate a Mx1 vector. These two can be used in HugeCTR.
Batched GEMV. It works better if the matrix is in the column major.
Much better support of Clang. Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere. Check the instructions.
We use 128 Byte L2 Prefetch hint in all global memory loads.
Provide a method to not fully unroll the epilogue which is useful when the elementwise operation is complex (e.g. gelu) to significantly reduce I-Cache misses. Just need to set a compile time constant kIsHeavy to true. See this example.
We changed the GEMM stride to be 64-bit from 32-bit, the extent is still remained to be 32-bit.
Affine-2 GEMM in which both dimensions of the matrix are not contiguous.
Quaternion (a + bi + cj + dk) gemm and convolution.
CMake flag -DCUTLASS_NAMESPACE=xxx can be used to customize the top level namespace to avoid the namespace conflict with other NVIDIA libraries.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
2.6/2.6.1 is finally tagged, we add a lot of new features as well as improve the performance across the board in this release. README and CHANGELOG list all of them. Below adds a little more details
gemm+bias_vector
and the second isgemm+bias_vector+relu
. In addition to generate a normal result matrix with customizable elementwise operation, Reduction reduces the the result inN
dimension and generate aMx1
vector. These two can be used in HugeCTR.kIsHeavy
to true. See this example.a + bi + cj + dk
) gemm and convolution.-DCUTLASS_NAMESPACE=xxx
can be used to customize the top level namespace to avoid the namespace conflict with other NVIDIA libraries.Beta Was this translation helpful? Give feedback.
All reactions