-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support INT4 weight only quantize, including RTN and GPTQ 2 algorithms #17390
Conversation
Signed-off-by: yuwenzho <[email protected]>
Hi @yufenglee , please review the PR. Thanks. |
Signed-off-by: yuwenzho <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lintrunner found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.
Signed-off-by: yuwenzho <[email protected]>
We add a new attribute 'accuracy_level' to MatMulNBits op for our another PR: #17669, which is intended for efficient int4 kernels. # accuracy_level supports 0, 1, 2, 3, 4, default is 0
# 0 means using original ort kernel of MatMulNBits, 1 means optimized fp32 for intel CPU
# 2 means using fp16 for activation, 3 means using bf16 for activation
# 4 means using int8 for activation
weight_only_config = RTNWeightOnlyQuantConfig(accuracy_level=4) NotePlease use the latest version of INC to try this attribute |
/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline |
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
/azp run ONNX Runtime Web CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
Azure Pipelines successfully started running 7 pipeline(s). |
Azure Pipelines successfully started running 9 pipeline(s). |
@mengniwang95, does the example here need to be updated: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama/weight_only_quant? |
Hi, this example use the API of neural-compressor and is compatible to our latest version so it is not necessary to update it. If you want to use ort API to do woq quantization, we can update as well. |
Could you extend the existing interface |
Hi, if we extend |
Hi @mengniwang95, quantize_static is not good place. could you please combine you logic into the MatMul4BitsQuantizer by allowing it to take into different quant config?
|
@mengniwang95 I tried to build your branch from scratch but it encounter eigen fetch error on Linux. Could you merge the latest main branch so I can try it locally? |
Hi @yihonglyu , I merged the latest main branch. @mengniwang95 took half a day off and will be back soon. |
/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline |
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
/azp run ONNX Runtime Web CI Pipeline |
/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline |
Azure Pipelines successfully started running 6 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
Signed-off-by: yuwenzho <[email protected]>
I Fixed the code scanning failure. @yufenglee Could you please help run the test again? |
/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline |
Azure Pipelines successfully started running 6 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
Signed-off-by: yuwenzho <[email protected]>
@yufenglee Could you please help run the test again? Thanks! |
/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline |
Azure Pipelines successfully started running 6 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
Signed-off-by: yuwenzho <[email protected]>
I fixed the import path. Now the usage is: from onnxruntime.quantization import matmul_4bits_quantizer
config = matmul_4bits_quantizer.RTNWeightOnlyQuantConfig()
# config = matmul_4bits_quantizer.GPTQWeightOnlyQuantConfig(calibration_data_reader=data_reader) @yufenglee Could you please help run the test again? Thank you! |
/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline |
Azure Pipelines successfully started running 6 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
Description
Support INT4 weight only quantize (WOQ) via Intel Neural Compressor, including RTN and GPTQ 2 algorithms.
Note:
Please install
neural-compressor==2.3
for weight only quantize.Motivation and Context
As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy. Compared to normal quantization like W8A8, weight only quantization is probably a better trade-off to balance the performance and the accuracy.
RTN is the most straightforward way to quantize weight.
GPTQ algorithm provides more accurate quantization but requires more computational resources.
Evaluation results
The following table shows the accuracy results of Llama-2 models evaluated on lambada_openai task.
GPTQ W4G32Asym
in configuration column means GPTQ algorithm is used for 4-bit weight only quantization, setting group_size=32 and scheme=asym.[WOQ/FP32]
W4G32Asym
W4G32Asym
W4G128Asym
W4G32Asym
W4G32Sym
W4G32Sym