Connecting fp16xq4 gemm kernels (optimized for A100) to MatMulNBits<fp16> operator #21083
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This change hookup two A100 specialized gemm kernels to MatMulNBits.Float16 operator.
Also created a hacky solution of quantized weight prepacking that works in GPU operators.
Motivation and Context
Using specialized A100 kernels to accelerate MatMulNBits operator.
Currently we have two kernels:
blkq4_gemm_sm80
is optimized for throughput. It works well with large size GEMM problems and achieves more than 95% of peak device FLOPS. When the GEMM size is small, part of the device would remain idle.blkq4_small_gemm_sm80
is crafted specifically for very small size GEMM problems, trying to spread the computation to the entire device. This kernel achieves lower latency for small GEMMs.blkq4_fp16_gemm_sm80_dispatch
selectively dispatch to one of these two kernels based on GEMM size. In the future lots of tuning can be done here.Manually ran the following tests on A100 devices:
Prepacking for GPU operators
Our kernels require weight prepacking. Current prepacking infrastructure works well with CPU operators. Unfortunately, using it in GPU operator causes memory bloat -- old GPU memory buffers holding not yet packed data are not released after prepacking. This is especially problematic for LLMs.
In this change we developed a prepacking logic in a new graph optimizer. This solves the memory bloat problem. However, it also introduces the following problems:
In the operator, we have to expose a property to remember whether the operator is prepacked by the optimizer. This property is exposed to the user. But it should never be set by user.
We really need to find a better solution for GPU operator prepacking next. These problems are significant.