[QNN EP] Support per-channel quantized weights #20154

adrianlizarraga · 2024-03-30T10:44:12Z

Description

Adds general support for per-channel quantized weights to QNN EP (HTP backend).
Add QNN EP unit tests for per-channel Conv
Update quantization tool to allow selecting which ops are quantized per-channel (and which axis) via tensor-level overrides. Currently, setting per_channel=True assumes all Convs, MatMuls, Gemms, InstanceNormalization, and LayerNormalization ops should be quantized per-channel using some assumed default axis.

Creating QDQ per-channel Conv model example

from onnxruntime.quantization import CalibrationDataReader, QuantType, quantize
from onnxruntime.quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model

class DataReader(CalibrationDataReader):
    # TODO: See ONNX Runtime QNN docs for example of a data reader
    # https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#generating-a-quantized-model-x64
    pass

if __name__ == "__main__":
    input_model_path = "model.onnx"
    my_data_reader = DataReader(model_to_quantize)

    # Pre-process the original float32 model.
    preproc_model_path = "model.preproc.onnx"
    model_changed = qnn_preprocess_model(input_model_path, preproc_model_path)
    model_to_quantize = preproc_model_path if model_changed else input_model_path

    # RELEVANT TO THIS PR:
    # Make sure Conv's weight input is quantized to int8/symmetric/per-channel with axis == 0.
    # The presence of the 'axis' key indicates that this is a per-channel quantized weight.
    init_overrides = {'weight': [{'axis': 0, 'quant_type': QuantType.QInt8, 'symmetric': True}]}

    qnn_config = get_qnn_qdq_config(model_to_quantize,
                                    my_data_reader,
                                    init_overrides=init_overrides,
                                    activation_type=QuantType.QUInt16, # uint16 activations
                                    weight_type=QuantType.QUInt8)      # uint8 weights by default

    quantize(model_to_quantize, "model.qdq.onnx", qnn_config)

float32 model:

QDQ model (per-channel Conv weight):

Motivation and Context

Support more models, especially models with int4 quantized weights.

…el that had axis > scale.shape

…g context cache tensors

onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc

onnxruntime/core/providers/qnn/builder/qnn_model_wrapper.cc

…or quant override validation error messages

onnxruntime/test/optimizer/qdq_test_utils.h

…errides.

HectorSVC

skottmckay · 2024-04-15T22:06:27Z

Does the CPU EP implementation support this? e.g. there are comments like this in places.

onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_s8_to_u8.cc

Lines 42 to 47 in 287ecea

    
           // TODO(fuchen): need to augment this when we support per row quantization 
        
           using ONNX_TENSOR_ELEM_TYPE = ONNX_NAMESPACE::TensorProto::DataType; 
        
           Initializer q_zero_point(*q_zp_tensor_proto, graph.ModelPath()); 
        
           Initializer dq_zero_point(*dq_zp_tensor_proto, graph.ModelPath()); 
        
           if (q_zero_point.size() != 1 || 
        
               dq_zero_point.size() != 1 ||

onnxruntime/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc

Lines 1103 to 1105 in 287ecea

    
           //     - DQ node is currently ignored if it uses per-channel quantization 
        
           //       - supporting per-channel quantization requires modifying the scales and zero point data, which can be done 
        
           //         if/when there's a use-case to justify the development cost.

adrianlizarraga · 2024-04-15T23:30:28Z

Does the CPU EP implementation support this? e.g. there are comments like this in places.

onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_s8_to_u8.cc

Lines 42 to 47 in 287ecea

// TODO(fuchen): need to augment this when we support per row quantization

using ONNX_TENSOR_ELEM_TYPE = ONNX_NAMESPACE::TensorProto::DataType;

Initializer q_zero_point(*q_zp_tensor_proto, graph.ModelPath());

Initializer dq_zero_point(*dq_zp_tensor_proto, graph.ModelPath());

if (q_zero_point.size() != 1 ||

dq_zero_point.size() != 1 ||

onnxruntime/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc

Lines 1103 to 1105 in 287ecea

// - DQ node is currently ignored if it uses per-channel quantization

// - supporting per-channel quantization requires modifying the scales and zero point data, which can be done

// if/when there's a use-case to justify the development cost.

Thanks for taking a look.

This PR focuses on Conv with per-channel quantization on constant weight/bias inputs. Although the changes to QNN EP can potentially support any per-channel weights in the future, the focus is still on Conv. Various models with per-channel Conv have been tested on QNN EP with this branch. This PR also has basic operator-level unit tests that compare QDQ per-channel Conv for CPU EP and QNN EP.

Whether all optimizers used by CPU EP properly support per-channel quantization for dynamic and static inputs, is another matter that I would certainly consult with you and others about. Even if optimizers don't fully support per-channel, I would at least expect them to detect the quantization mode and correctly opt out of scenarios that are not handled. In the snippets you linked, this seems to be the case (e.g., if the DQ has multiple zp/scale values, don't continue). I would hope that optimizers don't assume per-tensor quantization without checking.

skottmckay · 2024-04-16T01:18:20Z

This PR also has basic operator-level unit tests that compare QDQ per-channel Conv for CPU EP and QNN EP.

Should be fine if this is the case.

### Description - Adds general support for per-channel quantized weights to QNN EP (HTP backend). - Add QNN EP unit tests for per-channel Conv - Update quantization tool to allow selecting which ops are quantized per-channel (and which axis) via tensor-level overrides. Currently, setting `per_channel=True` assumes all Convs, MatMuls, Gemms, InstanceNormalization, and LayerNormalization ops should be quantized per-channel using some assumed default axis. #### Creating QDQ per-channel Conv model example ```python from onnxruntime.quantization import CalibrationDataReader, QuantType, quantize from onnxruntime.quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model class DataReader(CalibrationDataReader): # TODO: See ONNX Runtime QNN docs for example of a data reader # https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#generating-a-quantized-model-x64 pass if __name__ == "__main__": input_model_path = "model.onnx" my_data_reader = DataReader(model_to_quantize) # Pre-process the original float32 model. preproc_model_path = "model.preproc.onnx" model_changed = qnn_preprocess_model(input_model_path, preproc_model_path) model_to_quantize = preproc_model_path if model_changed else input_model_path # RELEVANT TO THIS PR: # Make sure Conv's weight input is quantized to int8/symmetric/per-channel with axis == 0. # The presence of the 'axis' key indicates that this is a per-channel quantized weight. init_overrides = {'weight': [{'axis': 0, 'quant_type': QuantType.QInt8, 'symmetric': True}]} qnn_config = get_qnn_qdq_config(model_to_quantize, my_data_reader, init_overrides=init_overrides, activation_type=QuantType.QUInt16, # uint16 activations weight_type=QuantType.QUInt8) # uint8 weights by default quantize(model_to_quantize, "model.qdq.onnx", qnn_config) ``` float32 model: <img width="683" alt="image" src="https://github.com/microsoft/onnxruntime/assets/19691973/ca650e49-1ad0-47d8-8c46-17fbc224ca39"> QDQ model (per-channel Conv weight): <img width="748" alt="image" src="https://github.com/microsoft/onnxruntime/assets/19691973/6bd469f2-968b-4d11-9526-09b3e71f98e7"> ### Motivation and Context Support more models, especially models with int4 quantized weights.

adrianlizarraga added 14 commits March 26, 2024 17:23

Start porting per-channel support from prototype branch

4ba8256

Use new function in GetTensorInfo(). Update invalid external data mod…

141c34b

…el that had axis > scale.shape

Remove old quant param functions

e67b94b

Fix axis bounds check

7c952f2

Remove unnecessary files

4ecd48e

Start transposing qparams for conv weight. need to handle conv1d reshape

6d8a40b

Handle Conv, Gemm, MatMul, InstanceNorm, LayerNorm

f8a9a10

Start testing

2dde502

Test Conv per-channel (no bias), int8 weight

e7a9168

Fixed issue with bias

16bb9ab

Give NodeUnitIODef::QuantParam::axis a default value

35465a0

Get axis attribute without linking in a separate library to framework

620c809

Add QnnTensorWrapper::Init(Qnn_Tensor_t) function for use when loadin…

f7253ae

…g context cache tensors

Merge branch 'main' into adrianl/qnn-per-channel-quant

08e9987

adrianlizarraga requested review from HectorSVC and jywu-msft April 5, 2024 19:07

adrianlizarraga added 5 commits April 7, 2024 14:00

Merge branch 'main' into adrianl/qnn-per-channel-quant

978373b

Port over per-channel quant tool updates from prototype branch

865ea4b

Normalize axis

4120b4f

Github linter fixes (fix long lines that prevent linter from passing)

0c06fe9

Remove unnecessary reinterpret_cast

e34904b

adrianlizarraga commented Apr 8, 2024

View reviewed changes

onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc Outdated Show resolved Hide resolved

adrianlizarraga commented Apr 8, 2024

View reviewed changes

onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc Show resolved Hide resolved

adrianlizarraga commented Apr 8, 2024

View reviewed changes

onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc Outdated Show resolved Hide resolved

adrianlizarraga commented Apr 8, 2024

View reviewed changes

onnxruntime/core/providers/qnn/builder/qnn_model_wrapper.cc Show resolved Hide resolved

adrianlizarraga commented Apr 8, 2024

View reviewed changes

onnxruntime/core/providers/qnn/builder/qnn_model_wrapper.cc Outdated Show resolved Hide resolved

adrianlizarraga added 3 commits April 7, 2024 19:53

Clean up axis norm checks in quantize_per_channel_impl; Clean up tens…

0d75fcc

…or quant override validation error messages

EP clean up

c1cedd3

Add disabled matmul per-channel test

619aad1

adrianlizarraga changed the title ~~[DRAFT][QNN EP] Support per-channel quantized weights~~ [QNN EP] Support per-channel quantized weights Apr 8, 2024

adrianlizarraga commented Apr 8, 2024

View reviewed changes

onnxruntime/test/optimizer/qdq_test_utils.h Show resolved Hide resolved

adrianlizarraga marked this pull request as ready for review April 8, 2024 17:27

adrianlizarraga added 9 commits April 8, 2024 18:03

Add global per_channel handling for MatMul, others not supported yet

f790c1a

Merge branch 'main' into adrianl/qnn-per-channel-quant

0a91b0e

Re-enable qnn multithreaded power config tests

a5504a3

Merge branch 'main' into adrianl/qnn-per-channel-quant

957ac96

Remove disabled per-channel MatMul test

2bd219f

Merge branch 'main' into adrianl/qnn-per-channel-quant

8ca77c1

Quant Tool: If per_channel enabled, force MatMul to use per-tensor ov…

5a5fc6e

…errides.

Add unit tests: ConvTranspose, DepthwiseConv2d

2fb58bd

Change PerAxis to PerChannel for consistency

a609fa7

jywu-msft approved these changes Apr 13, 2024

View reviewed changes

Merge branch 'main' into adrianl/qnn-per-channel-quant

5ea761c

HectorSVC approved these changes Apr 15, 2024

View reviewed changes

adrianlizarraga merged commit f644ff9 into main Apr 16, 2024
90 of 94 checks passed

adrianlizarraga deleted the adrianl/qnn-per-channel-quant branch April 16, 2024 15:45

guotuofeng mentioned this pull request Apr 29, 2024

Vitis quantization is broken with ORT 1.18 microsoft/Olive#1125

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QNN EP] Support per-channel quantized weights #20154

[QNN EP] Support per-channel quantized weights #20154

adrianlizarraga commented Mar 30, 2024 •

edited

Loading

HectorSVC left a comment

skottmckay commented Apr 15, 2024 •

edited

Loading

adrianlizarraga commented Apr 15, 2024

skottmckay commented Apr 16, 2024

[QNN EP] Support per-channel quantized weights #20154

[QNN EP] Support per-channel quantized weights #20154

Conversation

adrianlizarraga commented Mar 30, 2024 • edited Loading

Description

Creating QDQ per-channel Conv model example

Motivation and Context

HectorSVC left a comment

Choose a reason for hiding this comment

skottmckay commented Apr 15, 2024 • edited Loading

adrianlizarraga commented Apr 15, 2024

skottmckay commented Apr 16, 2024

adrianlizarraga commented Mar 30, 2024 •

edited

Loading

skottmckay commented Apr 15, 2024 •

edited

Loading