[quant] supports act_order inputs in Matmulnbits and new quantization algorithm "hqq" #19106

wejoncy · 2024-01-12T07:25:59Z

Description

Support quantized GPTQ weight in huggingface like TheBloke/Llama-2-7B-Chat-GPTQ
Support Act_order for GPTQ
Support HQQ algorithm to quantize matmul weight and add quant script

Motivation and Context

onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py

onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.h

onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py

onnxruntime/core/graph/contrib_ops/contrib_defs.cc

onnxruntime/contrib_ops/cpu/quantization/dequantizeLinear_blockwise.cc

onnxruntime/contrib_ops/cuda/quantization/dequantizeLinear_blockwise.cc

onnxruntime/contrib_ops/cpu/quantization/dequantizeLinear_blockwise_imp.cc

onnxruntime/core/graph/contrib_ops/contrib_defs.cc

onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cu

docs/ContribOperators.md

onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.h

onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc

address comments

temp_models/test_llama.py

yufenglee

… algorithm "hqq" (microsoft#19106) ### Description  1. Support quantized GPTQ weight in huggingface like [TheBloke/Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ) 2. Support Act_order for GPTQ 3. Support [HQQ](https://mobiusml.github.io/hqq_blog/) algorithm to quantize matmul weight and add quant script ### Motivation and Context

…20146) ### Description Fixes code that extracts the accuracy level when creating a MatMulNBits node in the `DefaultWeightOnlyQuantizer` class. ### Motivation and Context Error from line 443: `AttributeError: 'DefaultWeightOnlyQuantizer' object has no attribute 'accuracy_level'`. The solution is to access `self.config.accuracy_level` instead of `self.accuracy_level`. Relevant commit: #19106

…icrosoft#20146) ### Description Fixes code that extracts the accuracy level when creating a MatMulNBits node in the `DefaultWeightOnlyQuantizer` class. ### Motivation and Context Error from line 443: `AttributeError: 'DefaultWeightOnlyQuantizer' object has no attribute 'accuracy_level'`. The solution is to access `self.config.accuracy_level` instead of `self.accuracy_level`. Relevant commit: microsoft#19106

wejoncy requested a review from yufenglee January 12, 2024 07:26

github-advanced-security bot found potential problems Jan 12, 2024

View reviewed changes

onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py Fixed Show fixed Hide fixed

wejoncy changed the title ~~Jicwen/matmulnbits gptq~~ [quant] matmulnbits support 2-8 bits, act_order, gptq/hqq Jan 12, 2024

wejoncy changed the title ~~[quant] matmulnbits support 2-8 bits, act_order, gptq/hqq~~ [quant] matmulnbits supports 2-8 bits with act_order and new quantization "hqq" Jan 12, 2024

github-advanced-security bot found potential problems Jan 12, 2024

View reviewed changes

onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py Fixed Show fixed Hide fixed

onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py Fixed Show fixed Hide fixed

wejoncy force-pushed the jicwen/matmulnbits_gptq branch from 01bfa24 to 555e34a Compare January 12, 2024 08:25

wejoncy changed the title ~~[quant] matmulnbits supports 2-8 bits with act_order and new quantization "hqq"~~ [quant] supports 2-8 bits kernel with act_order inputs in Op Matmulnbits and new quantization "hqq" Jan 12, 2024

yufenglee reviewed Jan 16, 2024

View reviewed changes

onnxruntime/core/graph/contrib_ops/contrib_defs.cc Outdated Show resolved Hide resolved

yufenglee reviewed Jan 16, 2024

View reviewed changes

onnxruntime/core/graph/contrib_ops/contrib_defs.cc Show resolved Hide resolved

github-advanced-security bot found potential problems Jan 17, 2024

View reviewed changes

yufenglee reviewed Jan 17, 2024

View reviewed changes

onnxruntime/core/graph/contrib_ops/contrib_defs.cc Outdated Show resolved Hide resolved

yufenglee reviewed Jan 17, 2024

View reviewed changes

onnxruntime/core/graph/contrib_ops/contrib_defs.cc Outdated Show resolved Hide resolved

yufenglee reviewed Jan 17, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cu Outdated Show resolved Hide resolved

wejoncy force-pushed the jicwen/matmulnbits_gptq branch 2 times, most recently from ed1cd8c to 1594d7b Compare January 20, 2024 11:45

wejoncy force-pushed the jicwen/matmulnbits_gptq branch from 6117551 to 2e58ea2 Compare February 7, 2024 05:59

wejoncy force-pushed the jicwen/matmulnbits_gptq branch from de14fbc to 1a2328e Compare February 23, 2024 05:53

wejoncy changed the title ~~[quant] supports 2-8 bits kernel with act_order inputs in Op Matmulnbits and new quantization "hqq"~~ [quant] supports act_order inputs in Matmulnbits and new quantization algorithm "hqq" Feb 23, 2024

wejoncy marked this pull request as ready for review February 26, 2024 03:23

wejoncy force-pushed the jicwen/matmulnbits_gptq branch 2 times, most recently from 8084a75 to 1167ad7 Compare February 26, 2024 08:59

yufenglee reviewed Feb 27, 2024

View reviewed changes

docs/ContribOperators.md Show resolved Hide resolved

yufenglee reviewed Feb 27, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.h Outdated Show resolved Hide resolved

yufenglee reviewed Feb 27, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc Show resolved Hide resolved

yufenglee reviewed Feb 27, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc Outdated Show resolved Hide resolved

wejoncy force-pushed the jicwen/matmulnbits_gptq branch from e011c2b to 012227c Compare February 27, 2024 03:09

wejoncy added 3 commits February 27, 2024 11:21

support matmulnbits with hqq

9f146e9

cuda kernel ready

1a011ab

python ut, hqq ut

138e53b

fix build warning && cpp lint

430bc4f

address comments

wejoncy force-pushed the jicwen/matmulnbits_gptq branch from 012227c to fd51543 Compare February 27, 2024 03:21

github-advanced-security bot found potential problems Feb 27, 2024

View reviewed changes

wejoncy force-pushed the jicwen/matmulnbits_gptq branch from 0deef36 to 9449f21 Compare February 27, 2024 10:13

format

6a3caa6

wejoncy force-pushed the jicwen/matmulnbits_gptq branch from 9449f21 to 6a3caa6 Compare February 28, 2024 03:06

lint

e1f9bbe

yufenglee approved these changes Mar 5, 2024

View reviewed changes

wejoncy merged commit 7e613ee into main Mar 5, 2024
95 checks passed

wejoncy deleted the jicwen/matmulnbits_gptq branch March 5, 2024 03:45

adrianlizarraga mentioned this pull request Mar 29, 2024

[Quant] Fix accuracy_level config option for MatMul 4bits quantizer #20146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quant] supports act_order inputs in Matmulnbits and new quantization algorithm "hqq" #19106

[quant] supports act_order inputs in Matmulnbits and new quantization algorithm "hqq" #19106

wejoncy commented Jan 12, 2024 •

edited

Loading

yufenglee left a comment

[quant] supports act_order inputs in Matmulnbits and new quantization algorithm "hqq" #19106

[quant] supports act_order inputs in Matmulnbits and new quantization algorithm "hqq" #19106

Conversation

wejoncy commented Jan 12, 2024 • edited Loading

Description

Motivation and Context

yufenglee left a comment

Choose a reason for hiding this comment

wejoncy commented Jan 12, 2024 •

edited

Loading