diff --git a/docs/execution-providers/QNN-ExecutionProvider.md b/docs/execution-providers/QNN-ExecutionProvider.md index 8adb87d9cbc51..6ad125c231bef 100644 --- a/docs/execution-providers/QNN-ExecutionProvider.md +++ b/docs/execution-providers/QNN-ExecutionProvider.md @@ -30,27 +30,40 @@ ONNX Runtime QNN Execution Provider has been built and tested with QNN 2.18.x an ## Build For build instructions, please see the [BUILD page](../build/eps.md#qnn). -[prebuilt NuGet package](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.QNN) + +## Pre-built Packages +Alternatively, ONNX Runtime with QNN EP can be installed from: +- [NuGet package](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.QNN) +- Nightly Python package (Windows ARM64): + - Requirements: + - Windows ARM64 + - Python 3.11.x + - Numpy 1.25.2 or >= 1.26.4 + - Install: `python -m pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ ort-nightly-qnn` ## Configuration Options -The QNN Execution Provider supports a number of configuration options. The `provider_option_keys`, `provider_options_values` enable different options for the application. Each `provider_options_keys` accepts values as shown below: +The QNN Execution Provider supports a number of configuration options. These provider options are specified as key-value string pairs. -|`provider_options_values` for `provider_options_keys = "backend_path"`|Description| +|`"backend_path"`|Description| |---|-----| |'libQnnCpu.so' or 'QnnCpu.dll'|Enable CPU backend. Useful for integration testing. CPU backend is a reference implementation of QNN operators| -|'libQnnHtp.so' or 'QnnHtp.dll'|Enable Htp backend. Offloads compute to NPU.| +|'libQnnHtp.so' or 'QnnHtp.dll'|Enable HTP backend. Offloads compute to NPU.| -|`provider_options_values` for `provider_options_keys = "profiling_level"`|Description| +|`"profiling_level"`|Description| |---|---| |'off'|| |'basic'|| |'detailed'|| -|`provider_options_values` for `provider_options_keys = "rpc_control_latency"`|Description| +|`"rpc_control_latency"`|Description| |---|---| |microseconds (string)|allows client to set up RPC control latency in microseconds| -|`provider_options_values` for `provider_options_keys = "htp_performance_mode"`|Description| +|`"vtcm_mb"`|Description| +|---|---| +|size in MB (string)|QNN VTCM size in MB, defaults to 0 (not set)| + +|`"htp_performance_mode"`|Description| |---|---| |'burst'|| |'balanced'|| @@ -62,8 +75,12 @@ The QNN Execution Provider supports a number of configuration options. The `prov |'power_saver'|| |'sustained_high_performance'|| +|`"qnn_saver_path"`|Description| +|---|---| +|filpath to 'QnnSaver.dll' or 'libQnnSaver.so'|File path to the QNN Saver backend library. Dumps QNN API calls to disk for replay/debugging.| + -|`provider_options_values` for `provider_options_keys = "qnn_context_priority"`|[Description](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_yielding.html)| +|`"qnn_context_priority"`|[Description](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_yielding.html)| |---|---| |'low'|| |'normal'|default.| @@ -71,13 +88,286 @@ The QNN Execution Provider supports a number of configuration options. The `prov |'high'|| -|`provider_options_values` for `provider_options_keys = "htp_graph_finalization_optimization_mode"`|Description| +|`"htp_graph_finalization_optimization_mode"`|Description| |---|---| |'0'|default.| |'1'|faster preparation time, less optimal graph.| |'2'|longer preparation time, more optimal graph.| |'3'|longest preparation time, most likely even more optimal graph.| +|`"soc_model"`|Description| +|---|---| +|Model number (string)|The SoC model number. Refer to the QNN SDK documentation for valid values. Defaults to "0" (unknown).| + +|`"htp_arch"`|Description| +|---|---| +|"0"|Default (none)| +|"68"|| +|"69"|| +|"73"|| +|"75"|| + +|`"device_id"`|Description| +|---|---| +|Device ID (string)|The ID of the device to use when setting `htp_arch`. Defaults to "0" (for single device).| + +## Supported ONNX operators + +|Operator|Notes| +|---|---| +|ai.onnx:Abs|| +|ai.onnx:Add|| +|ai.onnx:And|| +|ai.onnx:ArgMax|| +|ai.onnx:ArgMin|| +|ai.onnx:Asin|| +|ai.onnx:Atan|| +|ai.onnx:AveragePool|| +|ai.onnx:BatchNormalization|| +|ai.onnx:Cast|| +|ai.onnx:Clip|| +|ai.onnx:Concat|| +|ai.onnx:Conv|| +|ai.onnx:ConvTranspose|| +|ai.onnx:Cos|| +|ai.onnx:DepthToSpace|| +|ai.onnx:DequantizeLinear|| +|ai.onnx:Div|| +|ai.onnx:Elu|| +|ai.onnx:Equal|| +|ai.onnx:Exp|| +|ai.onnx:Expand|| +|ai.onnx:Flatten|| +|ai.onnx:Floor|| +|ai.onnx:Gather|Only supports positive indices| +|ai.onnx:Gelu|| +|ai.onnx:Gemm|| +|ai.onnx:GlobalAveragePool|| +|ai.onnx:Greater|| +|ai.onnx:GreaterOrEqual|| +|ai.onnx:GridSample|| +|ai.onnx:HardSwish|| +|ai.onnx:InstanceNormalization|| +|ai.onnx:LRN|| +|ai.onnx:LayerNormalization|| +|ai.onnx:LeakyRelu|| +|ai.onnx:Less|| +|ai.onnx:LessOrEqual|| +|ai.onnx:Log|| +|ai.onnx:LogSoftmax|| +|ai.onnx:LpNormalization|p == 2| +|ai.onnx:MatMul|Supported input data types on HTP backend: (uint8, uint8), (uint8, uint16), (uint16, uint8)| +|ai.onnx:Max|| +|ai.onnx:MaxPool|| +|ai.onnx:Min|| +|ai.onnx:Mul|| +|ai.onnx:Neg|| +|ai.onnx:Not|| +|ai.onnx:Or|| +|ai.onnx:Prelu|| +|ai.onnx:Pad|| +|ai.onnx:Pow|| +|ai.onnx:QuantizeLinear|| +|ai.onnx:ReduceMax|| +|ai.onnx:ReduceMean|| +|ai.onnx:ReduceMin|| +|ai.onnx:ReduceProd|| +|ai.onnx:ReduceSum|| +|ai.onnx:Relu|| +|ai.onnx:Resize|| +|ai.onnx:Round|| +|ai.onnx:Sigmoid|| +|ai.onnx:Sign|| +|ai.onnx:Sin|| +|ai.onnx:Slice|| +|ai.onnx:Softmax|| +|ai.onnx:SpaceToDepth|| +|ai.onnx:Split|| +|ai.onnx:Sqrt|| +|ai.onnx:Squeeze|| +|ai.onnx:Sub|| +|ai.onnx:Tanh|| +|ai.onnx:Tile|| +|ai.onnx:TopK|| +|ai.onnx:Transpose|| +|ai.onnx:Unsqueeze|| +|ai.onnx:Where|| +|com.microsoft:DequantizeLinear|Provides 16-bit integer dequantization support| +|com.microsoft:Gelu|| +|com.microsoft:QuantizeLinear|Provides 16-bit integer quantization support| + +Supported data types vary by operator and QNN backend. Refer to the [QNN SDK documentation](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/operations.html) for more information. + +## Running a model with QNN EP's HTP backend (Python) +
+ +The QNN HTP backend only supports quantized models. Models with 32-bit floating-point activations and weights must first be quantized to use a lower integer precision (e.g., 8-bit or 16-bit integers). + +This section provides instructions for quantizing a model and then running the quantized model on QNN EP's HTP backend using Python APIs. Please refer to the [quantization page](../performance/model-optimizations/quantization.md) for a broader overview of quantization concepts. + +### Model requirements +QNN EP does not support models with dynamic shapes (e.g., a dynamic batch size). Dynamic shapes must be fixed to a specific value. Refer to the documentation for [making dynamic input shapes fixed](../tutorials/mobile/helpers/make-dynamic-shape-fixed.md) for more information. + +Additionally, QNN EP supports a subset of ONNX operators (e.g., Loops and Ifs are not supported). Refer to the [list of supported ONNX operators](./QNN-ExecutionProvider.md#supported-onnx-operators). +### Generating a quantized model (x64) +The ONNX Runtime python package provides utilities for quantizing ONNX models via the `onnxruntime.quantization` import. The quantization utilities are currently only supported on x86_64 due to issues installing the `onnx` package on ARM64. +Therefore, it is recommended to either use an x64 machine to quantize models or, alternatively, use a separate x64 python installation on Windows ARM64 machines. + +Install the nightly ONNX Runtime x64 python package. +```shell +python -m pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ ort-nightly +``` + +Quantization for QNN EP requires the use of calibration input data. Using a calibration dataset that is representative of typical model inputs is crucial in generating an accurate quantized model. + +The following snippet defines a sample `DataReader` class that generates random float32 input data. Note that using random input data will most likely produce an inaccurate quantized model. +Refer to the [implementation of a Resnet data reader](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/image_classification/cpu/resnet50_data_reader.py) for one example of how to create a `CalibrationDataReader` that provides input from image files on disk. + +```python +# data_reader.py + +import numpy as np +import onnxruntime +from onnxruntime.quantization import CalibrationDataReader + + +class DataReader(CalibrationDataReader): + def __init__(self, model_path: str): + self.enum_data = None + + # Use inference session to get input shape. + session = onnxruntime.InferenceSession(model_path, providers=['CPUExecutionProvider']) + + inputs = session.get_inputs() + + self.data_list = [] + + # Generate 10 random float32 inputs + # TODO: Load valid calibration input data for your model + for _ in range(10): + input_data = {inp.name : np.random.random(inp.shape).astype(np.float32) for inp in inputs} + self.data_list.append(input_data) + + self.datasize = len(self.data_list) + + def get_next(self): + if self.enum_data is None: + self.enum_data = iter( + self.data_list + ) + return next(self.enum_data, None) + + def rewind(self): + self.enum_data = None + +``` + +The following snippet pre-processes the original model and then quantizes the pre-processed model to use `uint16` activations and `uint8` weights. +Although the quantization utilities expose the `uint8`, `int8`, `uint16`, and `int16` quantization data types, QNN operators typically support the `uint8` and `uint16` data types. +Refer to the [QNN SDK operator documentation](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/HtpOpDefSupplement.html) for the data type requirements of each QNN operator. + +```python +# quantize_model.py + +import data_reader +import numpy as np +import onnx +from onnxruntime.quantization import QuantType, quantize +from onnxruntime.quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model + +if __name__ == "__main__": + input_model_path = "model.onnx" # TODO: Replace with your actual model + output_model_path = "model.qdq.onnx" # Name of final quantized model + my_data_reader = data_reader.DataReader(input_model_path) + + # Pre-process the original float32 model. + preproc_model_path = "model.preproc.onnx" + model_changed = qnn_preprocess_model(input_model_path, preproc_model_path) + model_to_quantize = preproc_model_path if model_changed else input_model_path + + # Generate a suitable quantization configuration for this model. + # Note that we're choosing to use uint16 activations and uint8 weights. + qnn_config = get_qnn_qdq_config(model_to_quantize, + my_data_reader, + activation_type=QuantType.QUInt16, # uint16 activations + weight_type=QuantType.QUInt8) # uint8 weights + + # Quantize the model. + quantize(model_to_quantize, output_model_path, qnn_config) +``` + +Running `python quantize_model.py` will generate a quantized model called `model.qdq.onnx` that can be run on Windows ARM64 devices via ONNX Runtime's QNN EP. + +Refer to the following pages for more information on usage of the quantization utilities: +- [Quantization example for mobilenet on CPU EP](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/image_classification/cpu) +- [quantization/execution_providers/qnn/preprocess.py](https://github.com/microsoft/onnxruntime/blob/23996bbbbe0406a5c8edbf6b7dbd71e5780d3f4b/onnxruntime/python/tools/quantization/execution_providers/qnn/preprocess.py#L16) +- [quantization/execution_providers/qnn/quant_config.py](https://github.com/microsoft/onnxruntime/blob/23996bbbbe0406a5c8edbf6b7dbd71e5780d3f4b/onnxruntime/python/tools/quantization/execution_providers/qnn/quant_config.py#L20-L27) + +### Running a quantized model on Windows ARM64 +The following assumes that the [Qualcomm AI Engine SDK (QNN SDK)](https://qpm.qualcomm.com/main/tools/details/qualcomm_ai_engine_direct) has already been downloaded and installed to a location such as `C:\Qualcomm\AIStack\QNN\2.18.0.240101`, hereafter referred to as `QNN_SDK`. + +First, determine the HTP architecture version for your device by referring to the QNN SDK documentation: +- QNN_SDK\docs\QNN\general\htp\htp_backend.html#qnn-htp-backend-api +- QNN_SDK\docs\QNN\general\overview.html#supported-snapdragon-devices + +For example, Snapdragon 8cx Gen 3 (SC8280X) devices have an HTP architecture value of 68, and Snapdragon 8cx Gen 4 (SC8380XP) have an HTP architecture value of 73. In the following, replace `