Replies: 2 comments 3 replies
-
You are correct on everything you said. Load/Store size has to be the same as the memory alignment. For example, fp16 align1 tensor is 2B(=sizeof(fp16)) aligned, we need to use ld.2B to load. Fp16 align8 is 16B aligned, we can use ld.16B to load which is more efficient. |
Beta Was this translation helpful? Give feedback.
1 reply
-
the problem is that arbitary pointer may not always align with 256B. If it is only aligned with 8B, you can not use ld.16B on that pointer. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
I was wondering what is
align
in cutlass_profiler generated kernel names. Is it the alignment of the global arrays? Or is it something else? Seems like it affects the performance severely.Beta Was this translation helpful? Give feedback.
All reactions