Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INT4 Quantization] failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform! #17883

Closed
zjc664656505 opened this issue Oct 11, 2023 · 2 comments
Labels
ep:CUDA issues related to the CUDA execution provider quantization issues related to quantization

Comments

@zjc664656505
Copy link

zjc664656505 commented Oct 11, 2023

Describe the issue

I recently use intel@neural_compressor and the onnxruntime matmul_weight4_quantizer.py for doing the int4 weight-only quantization on my models. The model is successfully quantized. However, when I created session and run inference on it, I get the error onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /path/to/Llama-2-7b-hf-onnx-int4/decoder_model.onnx failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform!

My current onnxruntime version is onnxruntime 1.16.0, which should support the int4 quantized model based on the documentation - documentation link1, documentation link2.

I have also tried different provider including CPUExecutionProvider, CUDAExecutionProvider, but none of them works for loading the quantized int4 model. The CPU and GPU Models that I'm using are AMD Ryzen Threadripper 3970X 32-Core Processor and Nvidia RTX A6000 respectively.

Need to note that I installed the onnxruntime using pip install onnxruntime==1.16.0.

Also, there is similar issue posted in the intel@neural_compressor repo. Could anyone help me with this issue?

@chenfucn @edgchen1

Is there any help?

To reproduce

To reproduce this issue, please follow the instruction:

  1. quantize the model with matmul_weight4_quantizer.py:
om=load_model_with_shape_infer(Path("/path/to/Llama-2-7b-hf-onnx-int4/decoder_model.onnx"))
quant = MatMulWeight4Quantizer(om, 0)
quant.process()
quant.model.save_model_to_file("/path/to/Llama-2-7b-hf-onnx-int4/decoder_model.onnx", True)
  1. quantize the model with neural_compressor:
from neural_compressor import quantization, PostTrainingQuantConfig
from neural_compressor.utils.constant import FP32

config = PostTrainingQuantConfig(
    approach="weight_only",
    calibration_sampling_size=[8],
    op_type_dict={".*": {"weight": {"bits": 4, 
                                    "algorithm": ["GPTQ"], 
                                    "scheme": ["asym"], 
                                    "group_size": 32}}},
    op_name_dict={"/lm_head/MatMul":FP32}) # disable lm_head

q_model = quantization.fit(
    "/path/to/llama2_7b/decoder_model.onnx", # FP32 model path
    config,
    calib_dataloader=dataloader)
q_model.save("/path/to/Llama-2-7b-hf-onnx-int4/decoder_model.onnx") # INT4 model path
  1. run the model:
session=onnxruntime.InferenceSession("/path/to/llama2_7b/decoder_model.onnx",
                                       providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

session.run(["lm_head"], input_for_export)

Urgency

This is urgent and hope to be resolved as soon as possible!

Platform

Ubuntu

OS Version

20.04.6

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.1

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider quantization issues related to quantization labels Oct 11, 2023
@zjc664656505 zjc664656505 changed the title failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform! [INT4 Quantization] failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform! Oct 11, 2023
@zjc664656505
Copy link
Author

zjc664656505 commented Oct 16, 2023

The issue is directly related to the hardware platform instruction set - onnxruntime int4 supported hardware platform. Currently, onnxruntime MatMulfpq4 operator only supports CPU with AVX512 instruction set compatible. I believe this is a significant restriction. Especially when so many devices such as PC with AMD chips or edge devices such as android phone with ARM chips do not have AVX512 instruction set supported.

Is there any plan on extending compatible hardware platform for the INT4 operator?

@chenfucn

@chenfucn
Copy link
Contributor

chenfucn commented Oct 16, 2023

As the error messages clearly stated, this feature is not supported on all hardware platforms. Expanding support to other platforms requires writing hand tuned computing kernel for various instruction sets and test them on different devices. As we are trying to get resources to do this, we are spread very thin. Currently we are working on arm64 kernels.

Please consider contribute to onnxruntime code base to support this feature on other edge devices.

Thanks!

@chenfucn chenfucn closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider quantization issues related to quantization
Projects
None yet
Development

No branches or pull requests

2 participants