Connecting fp16xq4 gemm kernels (optimized for A100) to MatMulNBits<fp16> operator #21083

chenfucn · 2024-06-18T17:17:02Z

Description

This change hookup two A100 specialized gemm kernels to MatMulNBits.Float16 operator.

Also created a hacky solution of quantized weight prepacking that works in GPU operators.

Motivation and Context

Using specialized A100 kernels to accelerate MatMulNBits operator.

Currently we have two kernels:

blkq4_gemm_sm80 is optimized for throughput. It works well with large size GEMM problems and achieves more than 95% of peak device FLOPS. When the GEMM size is small, part of the device would remain idle.
blkq4_small_gemm_sm80 is crafted specifically for very small size GEMM problems, trying to spread the computation to the entire device. This kernel achieves lower latency for small GEMMs.

blkq4_fp16_gemm_sm80_dispatch selectively dispatch to one of these two kernels based on GEMM size. In the future lots of tuning can be done here.

Manually ran the following tests on A100 devices:

GpuOpPrepackTests.MatmulNBits
MatMulNBits.Float16
MatMulNBits.Float16Large

Prepacking for GPU operators

Our kernels require weight prepacking. Current prepacking infrastructure works well with CPU operators. Unfortunately, using it in GPU operator causes memory bloat -- old GPU memory buffers holding not yet packed data are not released after prepacking. This is especially problematic for LLMs.

In this change we developed a prepacking logic in a new graph optimizer. This solves the memory bloat problem. However, it also introduces the following problems:

Rewriting of the initializer tensors is restricted by operator shape inferencing rules. For example, there are 3 initializers for MatMulNBits, we can't combine them into a single tensor. And we have to remove part of the operator's shape inference logic, in case it freaks out about the shape of the packed tensors.
This rewriting logic is tightly coupled to each GPU operators. It really should be defined together with the operators. Now we are forced to put them in a graph optimizer module that has nothing to do with the operators. This increases fragmentation and maintenance cost.
The logic of prepacking depends on underlying GPU hardware. Currently this part is hard-coded for SM80.
In the operator, we have to expose a property to remember whether the operator is prepacked by the optimizer. This property is exposed to the user. But it should never be set by user.

We really need to find a better solution for GPU operator prepacking next. These problems are significant.

And to use graph transformer as prepack

chenfucn changed the title ~~Add fp16xq4 matmul sm80 cuda kernel to ORT operator~~ Connecting fp16xq4 gemm kernels (optimized for A100) to ORT operator Jun 18, 2024

chenfucn changed the title ~~Connecting fp16xq4 gemm kernels (optimized for A100) to ORT operator~~ Connecting fp16xq4 gemm kernels (optimized for A100) to MatMulNBits<fp16> operator Jun 18, 2024

chenfucn requested review from yufenglee and a team June 18, 2024 17:28

chenfucn force-pushed the cfu_transform_prepack branch from 45e0036 to 3fc508c Compare July 2, 2024 23:19

Add fp16xq4 matmul sm80 cuda kernel to ORT operator

d9fae0c

And to use graph transformer as prepack

chenfucn force-pushed the cfu_transform_prepack branch from 5bf1322 to d9fae0c Compare July 3, 2024 17:29

chenfucn added 4 commits July 10, 2024 08:57

fix rocm error

8078253

fix rocm again

759616a

fix maybe_unused

86f5c38

sm version guard

41b812d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connecting fp16xq4 gemm kernels (optimized for A100) to MatMulNBits<fp16> operator #21083

Connecting fp16xq4 gemm kernels (optimized for A100) to MatMulNBits<fp16> operator #21083

chenfucn commented Jun 18, 2024 •

edited

Loading

Connecting fp16xq4 gemm kernels (optimized for A100) to MatMulNBits<fp16> operator #21083

Are you sure you want to change the base?

Connecting fp16xq4 gemm kernels (optimized for A100) to MatMulNBits<fp16> operator #21083

Conversation

chenfucn commented Jun 18, 2024 • edited Loading

Description

Motivation and Context

Using specialized A100 kernels to accelerate MatMulNBits operator.

Prepacking for GPU operators

chenfucn commented Jun 18, 2024 •

edited

Loading