Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QNN EP] Support per-channel quantized weights #20154

Merged
merged 32 commits into from
Apr 16, 2024

Conversation

adrianlizarraga
Copy link
Contributor

@adrianlizarraga adrianlizarraga commented Mar 30, 2024

Description

  • Adds general support for per-channel quantized weights to QNN EP (HTP backend).
  • Add QNN EP unit tests for per-channel Conv
  • Update quantization tool to allow selecting which ops are quantized per-channel (and which axis) via tensor-level overrides. Currently, setting per_channel=True assumes all Convs, MatMuls, Gemms, InstanceNormalization, and LayerNormalization ops should be quantized per-channel using some assumed default axis.

Creating QDQ per-channel Conv model example

from onnxruntime.quantization import CalibrationDataReader, QuantType, quantize
from onnxruntime.quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model

class DataReader(CalibrationDataReader):
    # TODO: See ONNX Runtime QNN docs for example of a data reader
    # https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#generating-a-quantized-model-x64
    pass

if __name__ == "__main__":
    input_model_path = "model.onnx"
    my_data_reader = DataReader(model_to_quantize)

    # Pre-process the original float32 model.
    preproc_model_path = "model.preproc.onnx"
    model_changed = qnn_preprocess_model(input_model_path, preproc_model_path)
    model_to_quantize = preproc_model_path if model_changed else input_model_path

    # RELEVANT TO THIS PR:
    # Make sure Conv's weight input is quantized to int8/symmetric/per-channel with axis == 0.
    # The presence of the 'axis' key indicates that this is a per-channel quantized weight.
    init_overrides = {'weight': [{'axis': 0, 'quant_type': QuantType.QInt8, 'symmetric': True}]}

    qnn_config = get_qnn_qdq_config(model_to_quantize,
                                    my_data_reader,
                                    init_overrides=init_overrides,
                                    activation_type=QuantType.QUInt16, # uint16 activations
                                    weight_type=QuantType.QUInt8)      # uint8 weights by default

    quantize(model_to_quantize, "model.qdq.onnx", qnn_config)

float32 model:
image

QDQ model (per-channel Conv weight):
image

Motivation and Context

Support more models, especially models with int4 quantized weights.

@adrianlizarraga adrianlizarraga changed the title [DRAFT][QNN EP] Support per-channel quantized weights [QNN EP] Support per-channel quantized weights Apr 8, 2024
@adrianlizarraga adrianlizarraga marked this pull request as ready for review April 8, 2024 17:27
Copy link
Contributor

@HectorSVC HectorSVC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@skottmckay
Copy link
Contributor

skottmckay commented Apr 15, 2024

Does the CPU EP implementation support this? e.g. there are comments like this in places.

// TODO(fuchen): need to augment this when we support per row quantization
using ONNX_TENSOR_ELEM_TYPE = ONNX_NAMESPACE::TensorProto::DataType;
Initializer q_zero_point(*q_zp_tensor_proto, graph.ModelPath());
Initializer dq_zero_point(*dq_zp_tensor_proto, graph.ModelPath());
if (q_zero_point.size() != 1 ||
dq_zero_point.size() != 1 ||

// - DQ node is currently ignored if it uses per-channel quantization
// - supporting per-channel quantization requires modifying the scales and zero point data, which can be done
// if/when there's a use-case to justify the development cost.

@adrianlizarraga
Copy link
Contributor Author

Does the CPU EP implementation support this? e.g. there are comments like this in places.

// TODO(fuchen): need to augment this when we support per row quantization
using ONNX_TENSOR_ELEM_TYPE = ONNX_NAMESPACE::TensorProto::DataType;
Initializer q_zero_point(*q_zp_tensor_proto, graph.ModelPath());
Initializer dq_zero_point(*dq_zp_tensor_proto, graph.ModelPath());
if (q_zero_point.size() != 1 ||
dq_zero_point.size() != 1 ||

// - DQ node is currently ignored if it uses per-channel quantization
// - supporting per-channel quantization requires modifying the scales and zero point data, which can be done
// if/when there's a use-case to justify the development cost.

Thanks for taking a look.

This PR focuses on Conv with per-channel quantization on constant weight/bias inputs. Although the changes to QNN EP can potentially support any per-channel weights in the future, the focus is still on Conv. Various models with per-channel Conv have been tested on QNN EP with this branch. This PR also has basic operator-level unit tests that compare QDQ per-channel Conv for CPU EP and QNN EP.

Whether all optimizers used by CPU EP properly support per-channel quantization for dynamic and static inputs, is another matter that I would certainly consult with you and others about. Even if optimizers don't fully support per-channel, I would at least expect them to detect the quantization mode and correctly opt out of scenarios that are not handled. In the snippets you linked, this seems to be the case (e.g., if the DQ has multiple zp/scale values, don't continue). I would hope that optimizers don't assume per-tensor quantization without checking.

@skottmckay
Copy link
Contributor

This PR also has basic operator-level unit tests that compare QDQ per-channel Conv for CPU EP and QNN EP.

Should be fine if this is the case.

@adrianlizarraga adrianlizarraga merged commit f644ff9 into main Apr 16, 2024
90 of 94 checks passed
@adrianlizarraga adrianlizarraga deleted the adrianl/qnn-per-channel-quant branch April 16, 2024 15:45
TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this pull request May 7, 2024
### Description
- Adds general support for per-channel quantized weights to QNN EP (HTP
backend).
- Add QNN EP unit tests for per-channel Conv
- Update quantization tool to allow selecting which ops are quantized
per-channel (and which axis) via tensor-level overrides. Currently,
setting `per_channel=True` assumes all Convs, MatMuls, Gemms,
InstanceNormalization, and LayerNormalization ops should be quantized
per-channel using some assumed default axis.

#### Creating QDQ per-channel Conv model example
```python
from onnxruntime.quantization import CalibrationDataReader, QuantType, quantize
from onnxruntime.quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model

class DataReader(CalibrationDataReader):
    # TODO: See ONNX Runtime QNN docs for example of a data reader
    # https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#generating-a-quantized-model-x64
    pass

if __name__ == "__main__":
    input_model_path = "model.onnx"
    my_data_reader = DataReader(model_to_quantize)

    # Pre-process the original float32 model.
    preproc_model_path = "model.preproc.onnx"
    model_changed = qnn_preprocess_model(input_model_path, preproc_model_path)
    model_to_quantize = preproc_model_path if model_changed else input_model_path

    # RELEVANT TO THIS PR:
    # Make sure Conv's weight input is quantized to int8/symmetric/per-channel with axis == 0.
    # The presence of the 'axis' key indicates that this is a per-channel quantized weight.
    init_overrides = {'weight': [{'axis': 0, 'quant_type': QuantType.QInt8, 'symmetric': True}]}

    qnn_config = get_qnn_qdq_config(model_to_quantize,
                                    my_data_reader,
                                    init_overrides=init_overrides,
                                    activation_type=QuantType.QUInt16, # uint16 activations
                                    weight_type=QuantType.QUInt8)      # uint8 weights by default

    quantize(model_to_quantize, "model.qdq.onnx", qnn_config)
```

float32 model:
<img width="683" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/ca650e49-1ad0-47d8-8c46-17fbc224ca39">

QDQ model (per-channel Conv weight):
<img width="748" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/6bd469f2-968b-4d11-9526-09b3e71f98e7">

### Motivation and Context
Support more models, especially models with int4 quantized weights.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants