Support INT4 weight only quantize, including RTN and GPTQ 2 algorithms #17390

yuwenzho · 2023-09-01T08:01:56Z

Description

Support INT4 weight only quantize (WOQ) via Intel Neural Compressor, including RTN and GPTQ 2 algorithms.

Note:
Please install neural-compressor==2.3 for weight only quantize.

Motivation and Context

As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy. Compared to normal quantization like W8A8, weight only quantization is probably a better trade-off to balance the performance and the accuracy.
RTN is the most straightforward way to quantize weight.
GPTQ algorithm provides more accurate quantization but requires more computational resources.

Evaluation results

The following table shows the accuracy results of Llama-2 models evaluated on lambada_openai task. GPTQ W4G32Asym in configuration column means GPTQ algorithm is used for 4-bit weight only quantization, setting group_size=32 and scheme=asym.

Model name	Configuration	Lambada_openai		Accuracy Ratio [WOQ/FP32]
Model name	Configuration	Accuracy	Perplexity	Accuracy Ratio [WOQ/FP32]
meta-llama/Llama-2-7b-chat-hf	FP32	0.7058	3.2788	/
meta-llama/Llama-2-7b-chat-hf	GPTQ W4G32Asym	0.7025	3.4489	99.53%
meta-llama/Llama-2-7b-hf	FP32	0.7392	3.3950	/
meta-llama/Llama-2-7b-hf	GPTQ W4G32Asym	0.7326	3.5286	99.11%
meta-llama/Llama-2-13b-chat-hf	FP32	0.7312	2.9163	/
meta-llama/Llama-2-13b-chat-hf	GPTQ W4G128Asym	0.7289	3.0061	99.56%
meta-llama/Llama-2-13b-hf	FP32	0.7677	3.0438	/
meta-llama/Llama-2-13b-hf	GPTQ W4G32Asym	0.7607	3.1562	99.09%
meta-llama/Llama-2-70b-chat-hf	FP32	0.7543	2.6181	/
meta-llama/Llama-2-70b-chat-hf	RTN W4G32Sym	0.7489	2.6850	99.28%
meta-llama/Llama-2-70b-hf	FP32	0.7964	2.6612	/
meta-llama/Llama-2-70b-hf	RTN W4G32Sym	0.7896	2.7546	99.15%

Signed-off-by: yuwenzho <[email protected]>

yuwenzho · 2023-09-14T06:33:28Z

Hi @yufenglee , please review the PR. Thanks.

Signed-off-by: yuwenzho <[email protected]>

github-advanced-security

lintrunner found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

Signed-off-by: yuwenzho <[email protected]>

mengniwang95 · 2023-11-10T07:19:54Z

We add a new attribute 'accuracy_level' to MatMulNBits op for our another PR: #17669, which is intended for efficient int4 kernels.
You can use below code to try our kernels:

# accuracy_level supports 0, 1, 2, 3, 4, default is 0
# 0 means using original ort kernel of MatMulNBits, 1 means optimized fp32 for intel CPU
# 2 means using fp16 for activation, 3 means using bf16 for activation
# 4 means using int8 for activation
weight_only_config = RTNWeightOnlyQuantConfig(accuracy_level=4)

Note

Please use the latest version of INC to try this attribute

yufenglee · 2023-11-16T17:44:48Z

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

yufenglee · 2023-11-16T17:44:58Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

yufenglee · 2023-11-16T17:45:06Z

/azp run ONNX Runtime Web CI Pipeline

azure-pipelines · 2023-11-16T17:45:18Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2023-11-16T17:45:23Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2023-11-16T17:45:37Z

Azure Pipelines successfully started running 9 pipeline(s).

onnxruntime/python/tools/quantization/quantize_weight_only.py

onnxruntime/test/python/quantization/test_quantize_weight_only.py

yufenglee · 2023-11-16T18:03:52Z

@mengniwang95, does the example here need to be updated: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama/weight_only_quant?

mengniwang95 · 2023-11-17T09:06:45Z

@mengniwang95, does the example here need to be updated: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama/weight_only_quant?

Hi, this example use the API of neural-compressor and is compatible to our latest version so it is not necessary to update it. If you want to use ort API to do woq quantization, we can update as well.

yihonglyu · 2023-11-17T21:15:38Z

Could you extend the existing interface quantize_static and allow users to choose the algorithm using options, instead of creating a separate interface like quantize_weight?

mengniwang95 · 2023-11-20T03:30:36Z

Could you extend the existing interface quantize_static and allow users to choose the algorithm using options, instead of creating a separate interface like quantize_weight?

Hi, if we extend quantize_static there will be some questions about args. For example, activation_type, quant_format, per_channel of quantize_static will not work on weight_only_quantiztaion, GPTQ methods will introduce some args which will not work on quantize_static. Is it acceptable?

yufenglee · 2023-11-20T16:28:32Z

Could you extend the existing interface quantize_static and allow users to choose the algorithm using options, instead of creating a separate interface like quantize_weight?

Hi, if we extend quantize_static there will be some questions about args. For example, activation_type, quant_format, per_channel of quantize_static will not work on weight_only_quantiztaion, GPTQ methods will introduce some args which will not work on quantize_static. Is it acceptable?

Hi @mengniwang95, quantize_static is not good place. could you please combine you logic into the MatMul4BitsQuantizer by allowing it to take into different quant config?

onnxruntime/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py

Line 26 in 3bcc137

class MatMul4BitsQuantizer:

yihonglyu · 2023-11-21T02:04:03Z

@mengniwang95 I tried to build your branch from scratch but it encounter eigen fetch error on Linux. Could you merge the latest main branch so I can try it locally?

yuwenzho · 2023-11-21T02:14:21Z

@mengniwang95 I tried to build your branch from scratch but it encounter eigen fetch error on Linux. Could you merge the latest main branch so I can try it locally?

Hi @yihonglyu , I merged the latest main branch. @mengniwang95 took half a day off and will be back soon.

yufenglee · 2023-11-21T23:14:15Z

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

yufenglee · 2023-11-21T23:14:36Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

azure-pipelines · 2023-11-21T23:14:46Z

Azure Pipelines successfully started running 7 pipeline(s).

yufenglee · 2023-11-21T23:14:54Z

/azp run ONNX Runtime Web CI Pipeline

yufenglee · 2024-01-08T19:20:11Z

/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

yufenglee · 2024-01-08T19:20:30Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

azure-pipelines · 2024-01-08T19:20:41Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-01-08T19:21:12Z

Azure Pipelines successfully started running 10 pipeline(s).

onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py

Signed-off-by: yuwenzho <[email protected]>

yuwenzho · 2024-01-09T01:20:27Z

I Fixed the code scanning failure. @yufenglee Could you please help run the test again?

yufenglee · 2024-01-09T17:21:37Z

/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

yufenglee · 2024-01-09T17:21:44Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

azure-pipelines · 2024-01-09T17:22:07Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-01-09T17:22:29Z

Azure Pipelines successfully started running 10 pipeline(s).

Signed-off-by: yuwenzho <[email protected]>

yuwenzho · 2024-01-10T01:00:49Z

@yufenglee Could you please help run the test again? Thanks!

yufenglee · 2024-01-10T01:20:40Z

/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

yufenglee · 2024-01-10T01:20:46Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

azure-pipelines · 2024-01-10T01:21:07Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-01-10T01:21:27Z

Azure Pipelines successfully started running 10 pipeline(s).

Signed-off-by: yuwenzho <[email protected]>

yuwenzho · 2024-01-10T07:17:08Z

I fixed the import path. Now the usage is:

from onnxruntime.quantization import matmul_4bits_quantizer
config = matmul_4bits_quantizer.RTNWeightOnlyQuantConfig()
# config = matmul_4bits_quantizer.GPTQWeightOnlyQuantConfig(calibration_data_reader=data_reader)

@yufenglee Could you please help run the test again? Thank you!

yufenglee · 2024-01-10T15:57:36Z

/azp run Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

yufenglee · 2024-01-10T15:57:44Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

azure-pipelines · 2024-01-10T15:58:06Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-01-10T15:58:31Z

Azure Pipelines successfully started running 10 pipeline(s).

yuwenzho added 2 commits September 1, 2023 15:37

add weight only quantize

3274b34

Signed-off-by: yuwenzho <[email protected]>

Merge branch 'main' into yuwenzho/int4

ad6b8f6

yuwenzho added 2 commits September 20, 2023 16:19

Merge branch 'main' into yuwenzho/int4

b232f2e

update inc API usage

2f867e2

Signed-off-by: yuwenzho <[email protected]>

github-advanced-security bot found potential problems Sep 25, 2023

View reviewed changes

yuwenzho and others added 3 commits October 10, 2023 15:25

update format

32f7eae

Signed-off-by: yuwenzho <[email protected]>

Merge branch 'main' into yuwenzho/int4

41e96a7

add accuracy_level attr

a8382ac

github-advanced-security bot found potential problems Nov 16, 2023

View reviewed changes

onnxruntime/python/tools/quantization/quantize_weight_only.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Nov 16, 2023

View reviewed changes

onnxruntime/test/python/quantization/test_quantize_weight_only.py Fixed Show fixed Hide fixed

Merge branch 'main' into yuwenzho/int4

930ce53

github-advanced-security bot found potential problems Jan 8, 2024

View reviewed changes

onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py Fixed Show fixed Hide fixed

yuwenzho added 2 commits January 9, 2024 09:15

fix for lint

0c035c4

Signed-off-by: yuwenzho <[email protected]>

Merge branch 'main' into yuwenzho/int4

475f388

yuwenzho dismissed yufenglee’s stale review via 475f388 January 9, 2024 01:17

fix for lint

83b3ed7

Signed-off-by: yuwenzho <[email protected]>

yuwenzho added 2 commits January 10, 2024 15:11

fix import

81390c0

Signed-off-by: yuwenzho <[email protected]>

Merge branch 'main' into yuwenzho/int4

62438b8

yufenglee approved these changes Jan 10, 2024

View reviewed changes

yufenglee merged commit 731b50d into microsoft:main Jan 10, 2024
54 checks passed

VishalX mentioned this pull request Feb 22, 2024

Incorrect/Garbage Responses for Llama-2-7b-hf with INT4 GPTQ/RTN Asymmetric Quantization #19450

Closed

Support INT4 weight only quantize, including RTN and GPTQ 2 algorithms #17390

Support INT4 weight only quantize, including RTN and GPTQ 2 algorithms #17390

Conversation

yuwenzho commented Sep 1, 2023 • edited Loading

Description

Motivation and Context

Evaluation results

yuwenzho commented Sep 14, 2023

github-advanced-security bot left a comment

Choose a reason for hiding this comment

mengniwang95 commented Nov 10, 2023 • edited Loading

Note

yufenglee commented Nov 16, 2023

yufenglee commented Nov 16, 2023

yufenglee commented Nov 16, 2023

azure-pipelines bot commented Nov 16, 2023

azure-pipelines bot commented Nov 16, 2023

azure-pipelines bot commented Nov 16, 2023

yufenglee commented Nov 16, 2023

mengniwang95 commented Nov 17, 2023 • edited Loading

yihonglyu commented Nov 17, 2023

mengniwang95 commented Nov 20, 2023

yufenglee commented Nov 20, 2023

yihonglyu commented Nov 21, 2023

yuwenzho commented Nov 21, 2023

yufenglee commented Nov 21, 2023

yufenglee commented Nov 21, 2023

azure-pipelines bot commented Nov 21, 2023

yufenglee commented Nov 21, 2023

yufenglee commented Jan 8, 2024

yufenglee commented Jan 8, 2024

azure-pipelines bot commented Jan 8, 2024

azure-pipelines bot commented Jan 8, 2024

yuwenzho commented Jan 9, 2024

yufenglee commented Jan 9, 2024

yufenglee commented Jan 9, 2024

azure-pipelines bot commented Jan 9, 2024

azure-pipelines bot commented Jan 9, 2024

yuwenzho commented Jan 10, 2024

yufenglee commented Jan 10, 2024

yufenglee commented Jan 10, 2024

azure-pipelines bot commented Jan 10, 2024

azure-pipelines bot commented Jan 10, 2024

yuwenzho commented Jan 10, 2024

yufenglee commented Jan 10, 2024

yufenglee commented Jan 10, 2024

azure-pipelines bot commented Jan 10, 2024

azure-pipelines bot commented Jan 10, 2024

yuwenzho commented Sep 1, 2023 •

edited

Loading

mengniwang95 commented Nov 10, 2023 •

edited

Loading

mengniwang95 commented Nov 17, 2023 •

edited

Loading