Replies: 1 comment 3 replies
-
How does |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The best compiler so far for Ampere.
Cutlass wgrad kernels are improved by 14% in geomean when evaluating the layers of resnet-50. The max improvement is 37%. No regression in any layer.
11.4 adds many new features in
ld
,cp.async
ptx (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-isa-version-7-4). The coming cutlass 2.6 will supportprefetch_size
which can slightly improve the performance of many kernels. If you cannot wait, you can just add them to https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/arch/memory.h and https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/arch/memory_sm80.h. For example,cp.async.ca.shared.global.L2::128B
. Moreover, please let us know if you find the new cache eviction policy feature is helpful to your applications. We can consider to support them in the future releases.Beta Was this translation helpful? Give feedback.
All reactions