-
The question is just as in the title. For instance: auto tiled_mma = make_tiled_mma(SM80_16x8x16_F32F16F16F32_TN{}, Layout<Shape<_1, _1, _2>>{}); Intuitively, the number of threads in each CTA will be doubled, but then do the register fragments for each thread only holds the reduced sum of half of the Thanks! |
Beta Was this translation helpful? Give feedback.
Answered by
ccecka
Mar 28, 2024
Replies: 1 comment 1 reply
-
That's right! The kernel would probably want to perform some kind of reduction in the epilogue or atomically update the global memory tile. |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
hyhieu
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
That's right! The kernel would probably want to perform some kind of reduction in the epilogue or atomically update the global memory tile.