You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[INT4 Quantization] failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform!
#17883
Closed
zjc664656505 opened this issue
Oct 11, 2023
· 2 comments
I recently use intel@neural_compressor and the onnxruntime matmul_weight4_quantizer.py for doing the int4 weight-only quantization on my models. The model is successfully quantized. However, when I created session and run inference on it, I get the error onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /path/to/Llama-2-7b-hf-onnx-int4/decoder_model.onnx failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform!
My current onnxruntime version is onnxruntime 1.16.0, which should support the int4 quantized model based on the documentation - documentation link1, documentation link2.
I have also tried different provider including CPUExecutionProvider, CUDAExecutionProvider, but none of them works for loading the quantized int4 model. The CPU and GPU Models that I'm using are AMD Ryzen Threadripper 3970X 32-Core Processor and Nvidia RTX A6000 respectively.
Need to note that I installed the onnxruntime using pip install onnxruntime==1.16.0.
Also, there is similar issue posted in the intel@neural_compressor repo. Could anyone help me with this issue?
zjc664656505
changed the title
failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform!
[INT4 Quantization] failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform!
Oct 11, 2023
The issue is directly related to the hardware platform instruction set - onnxruntime int4 supported hardware platform. Currently, onnxruntime MatMulfpq4 operator only supports CPU with AVX512 instruction set compatible. I believe this is a significant restriction. Especially when so many devices such as PC with AMD chips or edge devices such as android phone with ARM chips do not have AVX512 instruction set supported.
Is there any plan on extending compatible hardware platform for the INT4 operator?
As the error messages clearly stated, this feature is not supported on all hardware platforms. Expanding support to other platforms requires writing hand tuned computing kernel for various instruction sets and test them on different devices. As we are trying to get resources to do this, we are spread very thin. Currently we are working on arm64 kernels.
Please consider contribute to onnxruntime code base to support this feature on other edge devices.
Describe the issue
I recently use intel@neural_compressor and the onnxruntime matmul_weight4_quantizer.py for doing the int4 weight-only quantization on my models. The model is successfully quantized. However, when I created session and run inference on it, I get the error
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /path/to/Llama-2-7b-hf-onnx-int4/decoder_model.onnx failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform!
My current onnxruntime version is
onnxruntime 1.16.0
, which should support the int4 quantized model based on the documentation - documentation link1, documentation link2.I have also tried different provider including
CPUExecutionProvider, CUDAExecutionProvider
, but none of them works for loading the quantized int4 model. The CPU and GPU Models that I'm using areAMD Ryzen Threadripper 3970X 32-Core Processor
andNvidia RTX A6000
respectively.Need to note that I installed the
onnxruntime
usingpip install onnxruntime==1.16.0
.Also, there is similar issue posted in the
intel@neural_compressor
repo. Could anyone help me with this issue?@chenfucn @edgchen1
Is there any help?
To reproduce
To reproduce this issue, please follow the instruction:
matmul_weight4_quantizer.py
:neural_compressor
:Urgency
This is urgent and hope to be resolved as soon as possible!
Platform
Ubuntu
OS Version
20.04.6
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.16.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 12.1
The text was updated successfully, but these errors were encountered: