Use graph optimizer for gpu tensor prepack #19814

chenfucn · 2024-03-07T05:32:50Z

Description

Use graph optimizer for cuda operator prepacking.

Motivation and Context

Our first sm80 quantized gemm kernel requires prepacking. Using current prepack infrastructure in gpu operator causes memory bloat -- old memory buffer not released.

So we define prepacking logic in a new graph optimizer. This solves the memory bloat problem. However, it also introduces the following problems:

Rewriting of the initializer tensors is restricted by operator shape inferencing rules. For example, there are 3 initializers for MatMulNBits, we can't combine them into a single tensor even if we want to. And we have to make sure the operator's shape inference logic does NOT verify the initializer's shape.
These rewriting logic is tightly coupled to each GPU operators. It really should be defined together with those operators, instead of defining them in a graph optimizer module.
The logic of prepacking depends on underlying GPU hardware. Currently this part is hard-coded for SM80.
In the operator, we have to expose a property to remember whether the operator is prepacked by the optimizer. This property is exposed to the user. But it should never be set by user.

try to use graph transformer as prepack

onnxruntime/core/optimizer/gpu_ops_prepack.cc

onnxruntime/test/optimizer/gpu_op_prepack_test.cc

wejoncy · 2024-03-08T04:31:36Z

onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cu

+    bool column_wise_blocking,
+    bool small_m,
+    bool has_offsets>
+Status blkq4_gemm_sm80(int m, int n, int k, cudaStream_t stream,


I assume this can be supported by sm86, sm89?

wejoncy · 2024-03-08T04:56:27Z

onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cu

+    }
+    break;
+
+  case 64:


Would you support case=128? which is used widely.

it's difficult for this kernel. I am working on another version which hopefully can support that.

chenfucn · 2024-06-13T22:51:55Z

Restarting in another PR

chenfucn added 3 commits March 6, 2024 18:10

add sm80 kernel to op

445b8ec

try to use graph transformer as prepack

connect kernel to op

a88d680

resolve conflicts

6bd5c0c

chenfucn requested review from yufenglee and wejoncy March 7, 2024 05:33

github-advanced-security bot found potential problems Mar 7, 2024

View reviewed changes

onnxruntime/core/optimizer/gpu_ops_prepack.cc Fixed Show fixed Hide fixed

onnxruntime/test/optimizer/gpu_op_prepack_test.cc Fixed Show fixed Hide fixed

chenfucn force-pushed the cfu_transform_prepack branch 3 times, most recently from 01a6521 to 4b6097b Compare March 8, 2024 21:32

wejoncy reviewed Mar 9, 2024

View reviewed changes

chenfucn force-pushed the cfu_transform_prepack branch 3 times, most recently from d7bce4d to 65573be Compare March 13, 2024 23:17

bug fix: can't use A tensor shape[0] for M

f71cfa6

chenfucn force-pushed the cfu_transform_prepack branch 2 times, most recently from 63d5836 to 44a95f8 Compare March 14, 2024 23:17

strange compilation error

38d1851

chenfucn force-pushed the cfu_transform_prepack branch from 44a95f8 to 38d1851 Compare March 15, 2024 16:12

chenfucn closed this Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use graph optimizer for gpu tensor prepack #19814

Use graph optimizer for gpu tensor prepack #19814

chenfucn commented Mar 7, 2024

wejoncy Mar 8, 2024

chenfucn Mar 14, 2024

wejoncy Mar 8, 2024

chenfucn Mar 11, 2024

chenfucn commented Jun 13, 2024

Use graph optimizer for gpu tensor prepack #19814

Use graph optimizer for gpu tensor prepack #19814

Conversation

chenfucn commented Mar 7, 2024

Description

Motivation and Context

wejoncy Mar 8, 2024

Choose a reason for hiding this comment

chenfucn Mar 14, 2024

Choose a reason for hiding this comment

wejoncy Mar 8, 2024

Choose a reason for hiding this comment

chenfucn Mar 11, 2024

Choose a reason for hiding this comment

chenfucn commented Jun 13, 2024