How to select SparseGemm template arguments? #820

cgorac · 2023-02-13T21:05:43Z

cgorac
Feb 13, 2023

Hi,

I'm experimenting with sparse GEMM, namely am benchmarking code taken from 15_ampere_sparse_tensorop_gemm example against code based on dense matrices multiplication, on A100 and using a pair of matrices with the same dimensions as in given example (M=512, N=512, K=1024). In my benchmarks, the sparse GEMM turns out to be not faster, but instead about 2x slower, than dense matrix multiplication. So I tried with changing values in the following snippet in the example:

// This code section describes the tile size a thread block will compute
using ShapeMMAThreadBlock =
    cutlass::gemm::GemmShape<128, 128, 256>;  // <- threadblock tile M = 128, N = 128, K = 256
// This code section describes tile size a warp will compute
using ShapeMMAWarp = cutlass::gemm::GemmShape<64, 64, 256>;  // <- warp tile M = 64, N = 64, K = 256
// This code section describes the size of MMA op
using ShapeMMAOp = cutlass::gemm::GemmShape<16, 8, 128>;  // <- MMA Op tile M = 16, N = 8, K = 128

// This code section describes how threadblocks are scheduled on GPU
using SwizzleThreadBlock = cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>;  // <- ??

// This code section describes the epilogue part of the kernel
using EpilogueOp = cutlass::epilogue::thread::LinearCombination<
    ElementOutput,                                     // <- data type of output matrix
    128 / cutlass::sizeof_bits<ElementOutput>::value,  // <- the number of elements per vectorized
                                                       // memory access. For a byte, it's 16
                                                       // elements. This becomes the vector width of
                                                       // math instructions in the epilogue too
    ElementAccumulator,                                // <- data type of accumulator
    ElementComputeEpilogue>;  // <- data type for alpha/beta in linear combination function

// Number of pipelines you want to use
constexpr int NumStages = 3;

using Gemm = cutlass::gemm::device::SparseGemm<ElementInputA,
                                               LayoutInputA,
                                               ElementInputB,
                                               LayoutInputB,
                                               ElementOutput,
                                               LayoutOutput,
                                               ElementAccumulator,
                                               MMAOp,
                                               SmArch,
                                               ShapeMMAThreadBlock,
                                               ShapeMMAWarp,
                                               ShapeMMAOp,
                                               EpilogueOp,
                                               SwizzleThreadBlock,
                                               NumStages>;

However, I can't get much of an improvement, and oftentimes the code won't even compile for changed values of ShapeMMAThreadBlock, ShapeMMAWarp and ShapeMMAOp. I read CUTLASS docs, in particular "Efficient GEMM in CUDA" page, but I still have no clue about a heuristics to select these values (except that, obviously, for ShapeMMAOp one should select values supported by hardware for given data type), as well as NumStages parameter. So - any suggestions here?

Thanks.

hwu36 · 2023-02-13T21:23:24Z

hwu36
Feb 13, 2023
Maintainer

are you running int4b? k = 1024 is pretty small for int4b. sparse is good only when the problem size is big.

as to tile size selection, you could use cutlass profiler to pick one https://github.com/NVIDIA/cutlass/blob/master/media/docs/profiler.md

2 replies

cgorac Feb 13, 2023
Author

I'm trying to multiply cutlass::half_t datatypes.

So - then how to know which matrix sizes are big enough so that it's worth to use SparseGemm? And how to handle selection of these parameters for different matrix sizes at all, as I'm trying to come up with a code that would utilize SparseGemm on matrices of sizes that are not known at compile time? The profiler is useful tool, but it doesn't make much of a sense to run a profiler whenever a program faced with different matrix sizes in order to search the parameter space in order to find the best values of these...

hwu36 Feb 13, 2023
Maintainer

if you use fp16, you need to use 16x8x32 as instruction shape.

if you have to just use one tile size, then just pick 128x128x64 for threadblock tile size and 64x64x64 for warp tile size.

you can use profiler to check when sparse gemm makes sense to you.

To solve the tile selection problem, we introduce streamk for dense gemm (https://github.com/NVIDIA/cutlass/tree/master/examples/47_ampere_gemm_universal_streamk). If you are interested, you could port it to the sparse ones.

cgorac · 2023-02-14T21:59:50Z

cgorac
Feb 14, 2023
Author

Thanks for your reply. I ran cutlass_profiler for m,n,k=256,512,1024,2048, and CUTLASS sparse GEMM turns out to be consistently 2x faster than ordinary CUTLASS GEMM on these matrix sizes. So I've applied tile sizes that cutlass_profiler reported, i.e. 16x8x32 for instruction shape, then 32x64x64 for warp tile size, and 64x128x64 for threadblock tile size, as well as 6 for number of stages. Now, in my benchmarks CUTLASS sparse GEMM is about 1.1x faster for 512x512x1024 than cuBLAS based GEMM. I guess that may be ok, as cuBLAS may be faster than CUTLASS (or maybe the discrepancy is because cutlass_profiler produces result in column major format for ordinary GEMM with operands in row major format, while I'm benchmarking cuBLAS for result produces in row major format). Still, I mostly don't understand where these numbers come from:

According to Ampere Tuning Guide, the matrix instruction sizes for FP16/FP16/FP32 input/output/accumulator data types are "16x8x8 / 16x8x16", so I guess more is better so one should go with 16x8x16, that is then adjusted to 16x8x32 as instruction shape for sparse GEMM?
For warp tile size, I understand the product of sizes should be 32 times the product of instruction shapes, but why 32x64x64 here, and not some other combination?
For threadblock tile size, I understand again the product of sizes should be number of warps per block times the product of sizes of tiles handled by a warp. In this particular case, the blocks are going to have 4 warps i.e. 128 threads, and again I guess smaller block size makes sense for better occupancy, because large number of registers is probably used by individual threads. But again, why exactly 64x128x64 tile size here, and not something else?
Finally, I have no clue why cutlass_profiler decides for 6 stages instead for example 3? (Albeit, with mentioned tile sizes, using 3 stages also produces the same peformance.)

(Thanks for streamk suggestion, that's certainly also something worth checking.)

0 replies

hwu36 · 2023-02-15T04:36:23Z

hwu36
Feb 15, 2023
Maintainer

According to Ampere Tuning Guide, the matrix instruction sizes for FP16/FP16/FP32 input/output/accumulator data types are "16x8x8 / 16x8x16", so I guess more is better so one should go with 16x8x16, that is then adjusted to 16x8x32 as instruction shape for sparse GEMM?

correct.

Finally, I have no clue why cutlass_profiler decides for 6 stages instead for example 3? (Albeit, with mentioned tile sizes, using 3 stages also produces the same peformance.)

you can choose different stage size. usually any number >= 3 is fine as long as we have enough shmem size. we did some profiler to choose 6 for this tile size, but it may not works the best for your problem size.

As to the selection of the threadblock and warp tile size, the best way is to use the profiler to find the best one. there are many factors that can change the performance and it is hard to explain sometimes. usually big problem size is better to use big threadblock tile size, small problem size is better to use small threadblock tile size. big tile size is more efficient than small tile size. however, occupancy, wave quantization, work load distribution, memory/cache efficiency can impact the performance too.

we usually use 4 or 8 warps. when we divide a threadblock to warps, we prefer square-ish warp size and we also want warp size to be as big as possible.

0 replies

cgorac · 2023-02-15T09:58:24Z

cgorac
Feb 15, 2023
Author

Thanks for the clarifications, I have a better understanding now on how to choose these parameters.

In my case, it's about using 2:4 sparsity to improve GEMM performance, but from a library, so the code cannot know sizes and layouts of input matrices upfront. Thus, obviously it's not possible to use cutlass_profiler in such a situation to find optimal tile sizes and num stages values; so this is why I'm trying to come up with some sort of heuristics. Thus, my point is that CUTLASS would benefit if some helper methods provided for selection of these values, depending on the problem sizes etc., alike to how introduction of cudaOccupancy* methods was very helpful for choosing the best block/grid size for given kernel, for plain CUDA programming. Apparently, some heuristics are already there in the cutlass_profiler code, it would be good if these exposed directly from CUTLASS.

0 replies

hwu36 · 2023-02-15T15:09:03Z

hwu36
Feb 15, 2023
Maintainer

Cutlass does not have any heuristic code and it is not on our roadmap to build one. We'd like to use streamk to eliminate this kernel selection problem in the long run.

It is very hard to come up with a good heuristics to always pick a good kernel from 20ish candidates. If you problem sizes are limited, you could do some profiling and then build a decision tree or a small ranking model to pick.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to select SparseGemm template arguments? #820

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to select SparseGemm template arguments? #820

cgorac Feb 13, 2023

Replies: 5 comments · 2 replies

hwu36 Feb 13, 2023 Maintainer

cgorac Feb 13, 2023 Author

hwu36 Feb 13, 2023 Maintainer

cgorac Feb 14, 2023 Author

hwu36 Feb 15, 2023 Maintainer

cgorac Feb 15, 2023 Author

hwu36 Feb 15, 2023 Maintainer

cgorac
Feb 13, 2023

Replies: 5 comments 2 replies

hwu36
Feb 13, 2023
Maintainer

cgorac Feb 13, 2023
Author

hwu36 Feb 13, 2023
Maintainer

cgorac
Feb 14, 2023
Author

hwu36
Feb 15, 2023
Maintainer

cgorac
Feb 15, 2023
Author

hwu36
Feb 15, 2023
Maintainer