Replies: 5 comments 2 replies
-
are you running int4b? k = 1024 is pretty small for int4b. sparse is good only when the problem size is big. as to tile size selection, you could use cutlass profiler to pick one https://github.com/NVIDIA/cutlass/blob/master/media/docs/profiler.md |
Beta Was this translation helpful? Give feedback.
-
Thanks for your reply. I ran cutlass_profiler for m,n,k=256,512,1024,2048, and CUTLASS sparse GEMM turns out to be consistently 2x faster than ordinary CUTLASS GEMM on these matrix sizes. So I've applied tile sizes that
(Thanks for streamk suggestion, that's certainly also something worth checking.) |
Beta Was this translation helpful? Give feedback.
-
correct.
you can choose different stage size. usually any number >= 3 is fine as long as we have enough shmem size. we did some profiler to choose 6 for this tile size, but it may not works the best for your problem size. As to the selection of the threadblock and warp tile size, the best way is to use the profiler to find the best one. there are many factors that can change the performance and it is hard to explain sometimes. usually big problem size is better to use big threadblock tile size, small problem size is better to use small threadblock tile size. big tile size is more efficient than small tile size. however, occupancy, wave quantization, work load distribution, memory/cache efficiency can impact the performance too. we usually use 4 or 8 warps. when we divide a threadblock to warps, we prefer square-ish warp size and we also want warp size to be as big as possible. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the clarifications, I have a better understanding now on how to choose these parameters. In my case, it's about using 2:4 sparsity to improve GEMM performance, but from a library, so the code cannot know sizes and layouts of input matrices upfront. Thus, obviously it's not possible to use |
Beta Was this translation helpful? Give feedback.
-
Cutlass does not have any heuristic code and it is not on our roadmap to build one. We'd like to use streamk to eliminate this kernel selection problem in the long run. It is very hard to come up with a good heuristics to always pick a good kernel from 20ish candidates. If you problem sizes are limited, you could do some profiling and then build a decision tree or a small ranking model to pick. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I'm experimenting with sparse GEMM, namely am benchmarking code taken from 15_ampere_sparse_tensorop_gemm example against code based on dense matrices multiplication, on A100 and using a pair of matrices with the same dimensions as in given example (M=512, N=512, K=1024). In my benchmarks, the sparse GEMM turns out to be not faster, but instead about 2x slower, than dense matrix multiplication. So I tried with changing values in the following snippet in the example:
However, I can't get much of an improvement, and oftentimes the code won't even compile for changed values of
ShapeMMAThreadBlock
,ShapeMMAWarp
andShapeMMAOp
. I read CUTLASS docs, in particular "Efficient GEMM in CUDA" page, but I still have no clue about a heuristics to select these values (except that, obviously, for ShapeMMAOp one should select values supported by hardware for given data type), as well asNumStages
parameter. So - any suggestions here?Thanks.
Beta Was this translation helpful? Give feedback.
All reactions