-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QNN EP] Support per-channel quantized weights #20154
Conversation
…el that had axis > scale.shape
…g context cache tensors
onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc
Outdated
Show resolved
Hide resolved
…or quant override validation error messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the CPU EP implementation support this? e.g. there are comments like this in places.
onnxruntime/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc Lines 1103 to 1105 in 287ecea
|
Thanks for taking a look. This PR focuses on Conv with per-channel quantization on constant weight/bias inputs. Although the changes to QNN EP can potentially support any per-channel weights in the future, the focus is still on Conv. Various models with per-channel Conv have been tested on QNN EP with this branch. This PR also has basic operator-level unit tests that compare QDQ per-channel Conv for CPU EP and QNN EP. Whether all optimizers used by CPU EP properly support per-channel quantization for dynamic and static inputs, is another matter that I would certainly consult with you and others about. Even if optimizers don't fully support per-channel, I would at least expect them to detect the quantization mode and correctly opt out of scenarios that are not handled. In the snippets you linked, this seems to be the case (e.g., if the DQ has multiple zp/scale values, don't continue). I would hope that optimizers don't assume per-tensor quantization without checking. |
Should be fine if this is the case. |
### Description - Adds general support for per-channel quantized weights to QNN EP (HTP backend). - Add QNN EP unit tests for per-channel Conv - Update quantization tool to allow selecting which ops are quantized per-channel (and which axis) via tensor-level overrides. Currently, setting `per_channel=True` assumes all Convs, MatMuls, Gemms, InstanceNormalization, and LayerNormalization ops should be quantized per-channel using some assumed default axis. #### Creating QDQ per-channel Conv model example ```python from onnxruntime.quantization import CalibrationDataReader, QuantType, quantize from onnxruntime.quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model class DataReader(CalibrationDataReader): # TODO: See ONNX Runtime QNN docs for example of a data reader # https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#generating-a-quantized-model-x64 pass if __name__ == "__main__": input_model_path = "model.onnx" my_data_reader = DataReader(model_to_quantize) # Pre-process the original float32 model. preproc_model_path = "model.preproc.onnx" model_changed = qnn_preprocess_model(input_model_path, preproc_model_path) model_to_quantize = preproc_model_path if model_changed else input_model_path # RELEVANT TO THIS PR: # Make sure Conv's weight input is quantized to int8/symmetric/per-channel with axis == 0. # The presence of the 'axis' key indicates that this is a per-channel quantized weight. init_overrides = {'weight': [{'axis': 0, 'quant_type': QuantType.QInt8, 'symmetric': True}]} qnn_config = get_qnn_qdq_config(model_to_quantize, my_data_reader, init_overrides=init_overrides, activation_type=QuantType.QUInt16, # uint16 activations weight_type=QuantType.QUInt8) # uint8 weights by default quantize(model_to_quantize, "model.qdq.onnx", qnn_config) ``` float32 model: <img width="683" alt="image" src="https://github.com/microsoft/onnxruntime/assets/19691973/ca650e49-1ad0-47d8-8c46-17fbc224ca39"> QDQ model (per-channel Conv weight): <img width="748" alt="image" src="https://github.com/microsoft/onnxruntime/assets/19691973/6bd469f2-968b-4d11-9526-09b3e71f98e7"> ### Motivation and Context Support more models, especially models with int4 quantized weights.
Description
per_channel=True
assumes all Convs, MatMuls, Gemms, InstanceNormalization, and LayerNormalization ops should be quantized per-channel using some assumed default axis.Creating QDQ per-channel Conv model example
float32 model:
QDQ model (per-channel Conv weight):
Motivation and Context
Support more models, especially models with int4 quantized weights.