You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We speedup our pace and release 2.7 shortly after 2.6.1. README and CHANGELOG list all of new additions in 2.7. Below adds a little more details:
A Smoky fast strided dgrad kernel by smartly cutting down unnecessary computations. This kernel is used when any stride is larger than 1. To use it, just set StrideSupporthere to StrideSupport::kStrided. Now that we have optimized implementation for all convolution kernels, we change the default convolution algorithm to kOptimized from kAnalytic. The profiler only generates the optimized convolution kernels by default.
In the convolution kernels, we no longer require the channel to be 128bit aligned to use tensor cores. You don't have to pad your tensors to meet this requirement any more. Though, 128bit alignment can deliver the best performance. This is implemented by @mengchihe from the community. Thank you very much!!!
We added a mainloop fusion kernel in example 23. This kernel can reduce one of the GEMM operands along GEMM-k dimension when doing GEMM. It has a new 1xM or 1xN vector output depends on which operand is reduced. The additional reduction operation adds almost no runtime overhead. This can be used in Megatron.
We added a fp16 acceleration for gelu_taylor. The same idea can be applied to other activation functions.
We speedup the convolution unit testing time by 40x without losing coverage.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
We speedup our pace and release 2.7 shortly after 2.6.1. README and CHANGELOG list all of new additions in 2.7. Below adds a little more details:
StrideSupport
here toStrideSupport::kStrided
. Now that we have optimized implementation for all convolution kernels, we change the default convolution algorithm tokOptimized
fromkAnalytic
. The profiler only generates the optimized convolution kernels by default.1xM
or1xN
vector output depends on which operand is reduced. The additional reduction operation adds almost no runtime overhead. This can be used in Megatron.fp16
acceleration for gelu_taylor. The same idea can be applied to other activation functions.Beta Was this translation helpful? Give feedback.
All reactions