-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] NnapiExecutionProvider defers all nodes to CPUExecutionProvider #18571
Comments
QuantFormat.QOperator uses custom operators that are not implemented by all execution providers. QDQ format is more generic as it uses official ONNX operators that are wrapped in DQ/Q nodes that allows an EP like the NNAPI EP to convert those to the quantized equivalent. Is there a reason not to use QDQ format? I believe the CPU EP should be able to convert the QDQ node units into the equivalent QOperator at runtime. |
Since the reference model from the web seems to actually experience NPU acceleration I tried to align our model with the reference. And the reference model happens to be in QOperator format. So really I am just trying to reduce the number of differences between our model and the reference in hopes of getting our model to run efficiently on the NPU. From my understanding the custom operators introduced by QOperator format should be implemented for the I have also tried converting our .onnx model into an .ort model. Unfortunately the model did not show any improvement, neither with the I also stumbled across a PyTorch tutorial detailing how to prepare PyTorch model for NNAPI Execution. Could something like this be necessary? |
I was able to reproduce this with the FYI, there is more log output from the NNAPI EP that goes to the default logger. From Python, you can enable this with With the default logger's verbose output, we can see some info about why the ops are not supported:
Input type 3 corresponds to int8. Looks like uint8 would be supported. onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/helper.cc Lines 166 to 182 in b9fd9c5
The |
You struck gold with that observation! First of all thanks for the hint with You were right in assuming UInt8 might solve the issue. I discovered that the main function used for quantization does allow setting the activation and weight type: def quantize_static(
model_input,
model_output,
calibration_data_reader: CalibrationDataReader,
quant_format=QuantFormat.QDQ,
op_types_to_quantize=None,
per_channel=False,
reduce_range=False,
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
nodes_to_quantize=None,
nodes_to_exclude=None,
optimize_model=True,
use_external_data_format=False,
calibrate_method=CalibrationMethod.MinMax,
extra_options=None,
):
class QuantType(Enum):
QInt8 = 0
QUInt8 = 1 Setting those two parameters to quantize_static(model_input='mobilenet_v2_infer_float.onnx',
model_output='mobilenet_v2_infer_uint8_op_oriented.onnx',
calibration_data_reader=dr,
quant_format=QuantFormat.QOperator,
activation_type=QuantType.QUInt8,
weight_type=QuantType.QUInt8) With that true UInt8 model I achieve CPU inference in 62ms and NPU inference in 23ms.
I should now have a working pipeline to convert our existing PyTorch based .onnx models into UInt8 models that are ready to be run on the Thank you for pointing me in the right direction to solve this issue and thanks to everyone for your helpful insights! |
Great! I'll go ahead and close this issue. Feel free to open a new one if you have other questions. I didn't find docs for |
And yes, from a quick look at the code I believe |
Describe the issue
Goal
We are trying to utilize the NPU of our NXP i.MX8MP SoC. Our goal is to accelerate several ONNX models that previously ran on the CPU. For that we want to establish a pipeline to convert our non-quantized, FP32, Pytorch-based ONNX models to models we can run on the NPU of our SoC using the
NnapiExecutionProvider
of onnxruntime.Setup
Issue
Development of the pipeline is based on an example of Microsoft. This example first exports a PyTorch trained mobilenet V2 model into a non-quantized, FP32 ONNX Model. Afterwards it quantizes the Model into UInt8.
I tried to follow this example with the exact same mobilenet V2 model to get a UInt8 quantized Model that can run efficiently on the NPU. However the resulting model is exactly as fast as on CPU. When I use a mobilenet V2 model from the internet that was already UInt8 quantized I do observe a significant performance improvement over the CPU.
My question therefore is: Am I doing something wrong in my quantization/conversion pipeline?
Comparison of the two models
Reference: mobilenet_v2_uint8.v5.ort
NnapiExecutionProvider
. Only one node (QLinearGlobalAveragePool
) gets placed on theCPUExecutionProvider
.Our model: mobilenet_v2_infer_uint8_op_oriented.onnx
opset_version
oftorch.onnx.export
from12
to17
quant_format
ofquantize_static
fromQuantFormat.QDQ
toQuantFormat.QOperator
CPUExecutionProvider
Observations
NnapiExecutionProvider
defers all nodes to theCPUExecutionProvider
.Nnapi_...
nodes are shown to be placed on theNnapiExecutionProvider
. As far as I know theNnapiExecutionProvider
compiles the model for execution on the NPU. I therefore assume the twoNnapi_...
nodes are the two compiled graph partitions.zero_point
in our model is often very different (sometimes even negative)Flatten
instead ofReshape
QGemm
instead ofQLinearMatMul
andQLinearAdd
Attachments
Models
Contains mobilenet_v2_uint8.v5.ort and mobilenet_v2_infer_uint8_op_oriented.onnx.
mobilenet_v2_models.zip
mobilenet_v2_uint8.v5.ort.png
mobilenet_v2_infer_uint8_op_oriented.onnx.png
Execution logs
exec_log_our_model_cpu.txt
exec_log_our_model_npu.txt
exec_log_reference_model_cpu.txt
exec_log_reference_model_npu.txt
Python scripts
Contains quantize_float_to_uint8.py and run_onnx_sample.py.
python_scripts.zip
To reproduce
python3 quantize_float_to_uint8.py
:python3 run_onnx_sample.py --model mobilenet_v2_infer_uint8_op_oriented.onnx --image cat.jpg --pu npu
:Necessary scripts and files are in reproduce.zip
Urgency
No response
Platform
Linux
OS Version
Linux Yocto Mickledore 4.2
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.13.1
ONNX Runtime API
Python
Architecture
ARM64
Execution Provider
NNAPI
Execution Provider Library Version
No response
Model File
mobilenet_v2_models.zip
Is this a quantized model?
Yes
The text was updated successfully, but these errors were encountered: