[INT4 Quantization] failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform! #17883

zjc664656505 · 2023-10-11T04:38:04Z

Describe the issue

I recently use intel@neural_compressor and the onnxruntime matmul_weight4_quantizer.py for doing the int4 weight-only quantization on my models. The model is successfully quantized. However, when I created session and run inference on it, I get the error onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /path/to/Llama-2-7b-hf-onnx-int4/decoder_model.onnx failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform!

My current onnxruntime version is onnxruntime 1.16.0, which should support the int4 quantized model based on the documentation - documentation link1, documentation link2.

I have also tried different provider including CPUExecutionProvider, CUDAExecutionProvider, but none of them works for loading the quantized int4 model. The CPU and GPU Models that I'm using are AMD Ryzen Threadripper 3970X 32-Core Processor and Nvidia RTX A6000 respectively.

Need to note that I installed the onnxruntime using pip install onnxruntime==1.16.0.

Also, there is similar issue posted in the intel@neural_compressor repo. Could anyone help me with this issue?

@chenfucn @edgchen1

Is there any help?

To reproduce

To reproduce this issue, please follow the instruction:

quantize the model with matmul_weight4_quantizer.py:

om=load_model_with_shape_infer(Path("/path/to/Llama-2-7b-hf-onnx-int4/decoder_model.onnx"))
quant = MatMulWeight4Quantizer(om, 0)
quant.process()
quant.model.save_model_to_file("/path/to/Llama-2-7b-hf-onnx-int4/decoder_model.onnx", True)

quantize the model with neural_compressor:

from neural_compressor import quantization, PostTrainingQuantConfig
from neural_compressor.utils.constant import FP32

config = PostTrainingQuantConfig(
    approach="weight_only",
    calibration_sampling_size=[8],
    op_type_dict={".*": {"weight": {"bits": 4, 
                                    "algorithm": ["GPTQ"], 
                                    "scheme": ["asym"], 
                                    "group_size": 32}}},
    op_name_dict={"/lm_head/MatMul":FP32}) # disable lm_head

q_model = quantization.fit(
    "/path/to/llama2_7b/decoder_model.onnx", # FP32 model path
    config,
    calib_dataloader=dataloader)
q_model.save("/path/to/Llama-2-7b-hf-onnx-int4/decoder_model.onnx") # INT4 model path

run the model:

session=onnxruntime.InferenceSession("/path/to/llama2_7b/decoder_model.onnx",
                                       providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

session.run(["lm_head"], input_for_export)

Urgency

This is urgent and hope to be resolved as soon as possible!

Platform

Ubuntu

OS Version

20.04.6

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.1

The text was updated successfully, but these errors were encountered:

zjc664656505 · 2023-10-16T08:16:39Z

The issue is directly related to the hardware platform instruction set - onnxruntime int4 supported hardware platform. Currently, onnxruntime MatMulfpq4 operator only supports CPU with AVX512 instruction set compatible. I believe this is a significant restriction. Especially when so many devices such as PC with AMD chips or edge devices such as android phone with ARM chips do not have AVX512 instruction set supported.

Is there any plan on extending compatible hardware platform for the INT4 operator?

@chenfucn

chenfucn · 2023-10-16T16:28:48Z

As the error messages clearly stated, this feature is not supported on all hardware platforms. Expanding support to other platforms requires writing hand tuned computing kernel for various instruction sets and test them on different devices. As we are trying to get resources to do this, we are spread very thin. Currently we are working on arm64 kernels.

Please consider contribute to onnxruntime code base to support this feature on other edge devices.

Thanks!

github-actions bot added ep:CUDA issues related to the CUDA execution provider quantization issues related to quantization labels Oct 11, 2023

zjc664656505 mentioned this issue Oct 11, 2023

No size reduction for weight-only quantization intel/neural-compressor#1302

Closed

chenfucn closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INT4 Quantization] failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform! #17883

[INT4 Quantization] failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform! #17883

zjc664656505 commented Oct 11, 2023 •

edited

Loading

zjc664656505 commented Oct 16, 2023 •

edited

Loading

chenfucn commented Oct 16, 2023 •

edited

Loading

[INT4 Quantization] failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform! #17883

[INT4 Quantization] failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform! #17883

Comments

zjc664656505 commented Oct 11, 2023 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

zjc664656505 commented Oct 16, 2023 • edited Loading

chenfucn commented Oct 16, 2023 • edited Loading

zjc664656505 commented Oct 11, 2023 •

edited

Loading

zjc664656505 commented Oct 16, 2023 •

edited

Loading

chenfucn commented Oct 16, 2023 •

edited

Loading