What is "align" CL argument in cutlass_profiler? #226

navdeepkk · 2021-04-06T13:07:43Z

navdeepkk
Apr 6, 2021

Hi all,
I was wondering what is align in cutlass_profiler generated kernel names. Is it the alignment of the global arrays? Or is it something else? Seems like it affects the performance severely.

hwu36 · 2021-04-06T14:04:45Z

hwu36
Apr 6, 2021
Maintainer

You are correct on everything you said. Load/Store size has to be the same as the memory alignment. For example, fp16 align1 tensor is 2B(=sizeof(fp16)) aligned, we need to use ld.2B to load. Fp16 align8 is 16B aligned, we can use ld.16B to load which is more efficient.

1 reply

navdeepkk Apr 6, 2021
Author

Thanks for the quick reply. This makes me think that this align thing actually translates to the vector load/store width, as it is mentioned in the CUDA programming guide that, Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes. Is that correct? In that case, all align=1,2,4,8 are always valid, and they just vary in their respective efficiencies.

hwu36 · 2021-04-06T16:28:11Z

hwu36
Apr 6, 2021
Maintainer

the problem is that arbitary pointer may not always align with 256B. If it is only aligned with 8B, you can not use ld.16B on that pointer.

2 replies

navdeepkk Jun 2, 2021
Author

Thanks for the reply. However, I cannot frame a scenario for the previous statement. What I think you mean is that suppose there are 128 elements of 1B each to copy, then the issuing ld.16B on the 9th element would be invalid? Does the program crash in this case. Also, in mixed precision matmul, I don't see this happening while copying tiles from global to shared as long as the threadBlock tile sizes are multiples of 8 elements in the leading dimensions. Is there some case in which the scenario you mentioned may occur?

Thanks!!

hwu36 Jun 2, 2021
Maintainer

One scenario is that one wants to do GEMM over two sub-matrices located in a big matrix. The start address can then be arbitrary. For example, if the start address of one starting sub-matrix is 0x2, then you have to use ld.2B to do the loading.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is "align" CL argument in cutlass_profiler? #226

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What is "align" CL argument in cutlass_profiler? #226

navdeepkk Apr 6, 2021

Replies: 2 comments · 3 replies

hwu36 Apr 6, 2021 Maintainer

navdeepkk Apr 6, 2021 Author

hwu36 Apr 6, 2021 Maintainer

navdeepkk Jun 2, 2021 Author

hwu36 Jun 2, 2021 Maintainer

navdeepkk
Apr 6, 2021

Replies: 2 comments 3 replies

hwu36
Apr 6, 2021
Maintainer

navdeepkk Apr 6, 2021
Author

hwu36
Apr 6, 2021
Maintainer

navdeepkk Jun 2, 2021
Author

hwu36 Jun 2, 2021
Maintainer