[Performance] The 16-bit quantization QDQ model cannot be accelerated by CUDA #21478

duanshengliu · 2024-07-24T14:59:19Z

Describe the issue

My GPU is V100 CUDA Version: 12.0 or 11.8
CPU is Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz

I tested the performance of A8W8 and A16W16 quantization models on CPU and CUDA respectively. The performance of A16W16 quantization model on CUDA is even worse than that of CPU.

Summary:

Total Inference Time(s)（repeat=100）	A8W8	A16W16
CPUExecutionProvider	6.698 s ✔️	30.961 s ✔️
CUDAExecutionProvider	3.870 s ✔️	42.365 s ❓

Moreover, The A16W8 or A8W16 quantization models also have the similar issues.

To reproduce

This issue can be reproduced by using the relevant files in performance.zip. The reproduction commands and results are as follows,

cd path/to/performance
python run.py

then you will receive the following results:

mobilenetv2_a8w8.onnx ['CPUExecutionProvider'] Total Inference Time: 6.698 seconds
mobilenetv2_a8w8.onnx ['CUDAExecutionProvider'] Total Inference Time: 3.870 seconds
================================================================================
mobilenetv2_a16w16.onnx ['CPUExecutionProvider'] Total Inference Time: 30.961 seconds
mobilenetv2_a16w16.onnx ['CUDAExecutionProvider'] Total Inference Time: 42.365 seconds
================================================================================

Urgency

Urgent

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA12/CUDA11.8

Model File

No response

Is this a quantized model?

Yes

The text was updated successfully, but these errors were encountered:

duanshengliu · 2024-07-27T03:06:35Z

@snnn @skottmckay @tianleiwu Could you take a look at this issue

github-actions · 2024-08-26T15:00:59Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

github-actions bot added ep:CUDA issues related to the CUDA execution provider quantization issues related to quantization labels Jul 24, 2024

sophies927 added the performance issues related to performance regressions label Jul 25, 2024

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] The 16-bit quantization QDQ model cannot be accelerated by CUDA #21478

[Performance] The 16-bit quantization QDQ model cannot be accelerated by CUDA #21478

duanshengliu commented Jul 24, 2024 •

edited

Loading

duanshengliu commented Jul 27, 2024

github-actions bot commented Aug 26, 2024

[Performance] The 16-bit quantization QDQ model cannot be accelerated by CUDA #21478

[Performance] The 16-bit quantization QDQ model cannot be accelerated by CUDA #21478

Comments

duanshengliu commented Jul 24, 2024 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

duanshengliu commented Jul 27, 2024

github-actions bot commented Aug 26, 2024

duanshengliu commented Jul 24, 2024 •

edited

Loading