Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support INT4 weight only quantize, including RTN and GPTQ 2 algorithms #17390

Merged
merged 30 commits into from
Jan 10, 2024

Conversation

yuwenzho
Copy link
Contributor

@yuwenzho yuwenzho commented Sep 1, 2023

Description

Support INT4 weight only quantize (WOQ) via Intel Neural Compressor, including RTN and GPTQ 2 algorithms.

Note:
Please install neural-compressor==2.3 for weight only quantize.

Motivation and Context

As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy. Compared to normal quantization like W8A8, weight only quantization is probably a better trade-off to balance the performance and the accuracy.
RTN is the most straightforward way to quantize weight.
GPTQ algorithm provides more accurate quantization but requires more computational resources.

Evaluation results

The following table shows the accuracy results of Llama-2 models evaluated on lambada_openai task. GPTQ W4G32Asym in configuration column means GPTQ algorithm is used for 4-bit weight only quantization, setting group_size=32 and scheme=asym.

Model name Configuration Lambada_openai Accuracy Ratio
[WOQ/FP32]
Accuracy Perplexity
meta-llama/Llama-2-7b-chat-hf FP32 0.7058 3.2788 /
GPTQ
W4G32Asym
0.7025 3.4489 99.53%
meta-llama/Llama-2-7b-hf FP32 0.7392 3.3950 /
GPTQ
W4G32Asym
0.7326 3.5286 99.11%
meta-llama/Llama-2-13b-chat-hf FP32 0.7312 2.9163 /
GPTQ
W4G128Asym
0.7289 3.0061 99.56%
meta-llama/Llama-2-13b-hf FP32 0.7677 3.0438 /
GPTQ
W4G32Asym
0.7607 3.1562 99.09%
meta-llama/Llama-2-70b-chat-hf FP32 0.7543 2.6181 /
RTN
W4G32Sym
0.7489 2.6850 99.28%
meta-llama/Llama-2-70b-hf FP32 0.7964 2.6612 /
RTN
W4G32Sym
0.7896 2.7546 99.15%

@yuwenzho
Copy link
Contributor Author

Hi @yufenglee , please review the PR. Thanks.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lintrunner found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

@mengniwang95
Copy link
Contributor

mengniwang95 commented Nov 10, 2023

We add a new attribute 'accuracy_level' to MatMulNBits op for our another PR: #17669, which is intended for efficient int4 kernels.
You can use below code to try our kernels:

# accuracy_level supports 0, 1, 2, 3, 4, default is 0
# 0 means using original ort kernel of MatMulNBits, 1 means optimized fp32 for intel CPU
# 2 means using fp16 for activation, 3 means using bf16 for activation
# 4 means using int8 for activation
weight_only_config = RTNWeightOnlyQuantConfig(accuracy_level=4)

Note

Please use the latest version of INC to try this attribute

@yufenglee
Copy link
Member

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

@yufenglee
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@yufenglee
Copy link
Member

/azp run ONNX Runtime Web CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@yufenglee
Copy link
Member

@mengniwang95
Copy link
Contributor

mengniwang95 commented Nov 17, 2023

@mengniwang95, does the example here need to be updated: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama/weight_only_quant?

Hi, this example use the API of neural-compressor and is compatible to our latest version so it is not necessary to update it. If you want to use ort API to do woq quantization, we can update as well.

@yihonglyu
Copy link
Contributor

Could you extend the existing interface quantize_static and allow users to choose the algorithm using options, instead of creating a separate interface like quantize_weight?

@mengniwang95
Copy link
Contributor

Could you extend the existing interface quantize_static and allow users to choose the algorithm using options, instead of creating a separate interface like quantize_weight?

Hi, if we extend quantize_static there will be some questions about args. For example, activation_type, quant_format, per_channel of quantize_static will not work on weight_only_quantiztaion, GPTQ methods will introduce some args which will not work on quantize_static. Is it acceptable?

@yufenglee
Copy link
Member

Could you extend the existing interface quantize_static and allow users to choose the algorithm using options, instead of creating a separate interface like quantize_weight?

Hi, if we extend quantize_static there will be some questions about args. For example, activation_type, quant_format, per_channel of quantize_static will not work on weight_only_quantiztaion, GPTQ methods will introduce some args which will not work on quantize_static. Is it acceptable?

Hi @mengniwang95, quantize_static is not good place. could you please combine you logic into the MatMul4BitsQuantizer by allowing it to take into different quant config?

@yihonglyu
Copy link
Contributor

@mengniwang95 I tried to build your branch from scratch but it encounter eigen fetch error on Linux. Could you merge the latest main branch so I can try it locally?

@yuwenzho
Copy link
Contributor Author

@mengniwang95 I tried to build your branch from scratch but it encounter eigen fetch error on Linux. Could you merge the latest main branch so I can try it locally?

Hi @yihonglyu , I merged the latest main branch. @mengniwang95 took half a day off and will be back soon.

@yufenglee
Copy link
Member

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

@yufenglee
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@yufenglee
Copy link
Member

/azp run ONNX Runtime Web CI Pipeline

@yufenglee
Copy link
Member

/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@yufenglee
Copy link
Member

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@yuwenzho
Copy link
Contributor Author

yuwenzho commented Jan 9, 2024

I Fixed the code scanning failure. @yufenglee Could you please help run the test again?

@yufenglee
Copy link
Member

/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@yufenglee
Copy link
Member

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

Signed-off-by: yuwenzho <[email protected]>
@yuwenzho
Copy link
Contributor Author

@yufenglee Could you please help run the test again? Thanks!

@yufenglee
Copy link
Member

/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@yufenglee
Copy link
Member

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@yuwenzho
Copy link
Contributor Author

I fixed the import path. Now the usage is:

from onnxruntime.quantization import matmul_4bits_quantizer
config = matmul_4bits_quantizer.RTNWeightOnlyQuantConfig()
# config = matmul_4bits_quantizer.GPTQWeightOnlyQuantConfig(calibration_data_reader=data_reader)

@yufenglee Could you please help run the test again? Thank you!

@yufenglee
Copy link
Member

/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@yufenglee
Copy link
Member

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@yufenglee yufenglee merged commit 731b50d into microsoft:main Jan 10, 2024
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants