Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] fp16 support and performance #22242

Open
cbingdu opened this issue Sep 27, 2024 · 6 comments
Open

[Performance] fp16 support and performance #22242

cbingdu opened this issue Sep 27, 2024 · 6 comments
Labels
performance issues related to performance regressions platform:mobile issues related to ONNX Runtime mobile; typically submitted using template

Comments

@cbingdu
Copy link

cbingdu commented Sep 27, 2024

Describe the issue

FP16 model inference is slower compared to FP32. Does FP16 inference require additional configuration or just need to convert the model to FP16

To reproduce

convert onnx model from fp32 to fp16 using onnxmltools
onnxruntime c++ liblary inference(convert inputs and outputs data format from fp32 to fp16)

Urgency

No response

Platform

Android

OS Version

34

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.0

ONNX Runtime API

C++

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

@cbingdu cbingdu added the performance issues related to performance regressions label Sep 27, 2024
@github-actions github-actions bot added the platform:mobile issues related to ONNX Runtime mobile; typically submitted using template label Sep 27, 2024
@wejoncy
Copy link
Contributor

wejoncy commented Sep 27, 2024

Please share more information of your question.
Such as which platform, what EP do you use?
what's the model looks like?

@DakeQQ
Copy link

DakeQQ commented Sep 30, 2024

I've encountered the same issue: when running inference on *.onnx Float16 models (such as: LLM, YOLO, VAE, Unet, Bert...) directly with the CPU, there is no noticeable speedup. This is a significant problem because one would expect a performance gain from using Float16.

Moreover, if you convert the model to the *.ort format, the conversion tool automatically inserts a Cast operator that converts FP16 back to FP32. This automatic conversion completely negates any potential acceleration benefits we might have gained by using the NNAPI runtime with FP16 after converting to *.ort.

Given that CPUs supporting Arm64-v8.2 and later versions do indeed support FP16 computations, I would greatly appreciate it if ONNX Runtime could prioritize the implementation of ARM-CPU-FP16 support. This feature would be highly beneficial for many users and would significantly improve the efficiency of mobile models.

@skottmckay
Copy link
Contributor

We're working on adding more fp16 support on arm64 as well as gpu support (which would handle fp16 models as well).

Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Oct 30, 2024
@leigaol
Copy link

leigaol commented Oct 30, 2024

+1 to this

@github-actions github-actions bot removed the stale issues that have not been addressed in a while; categorized by a bot label Oct 31, 2024
@devYonz
Copy link

devYonz commented Nov 25, 2024

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance issues related to performance regressions platform:mobile issues related to ONNX Runtime mobile; typically submitted using template
Projects
None yet
Development

No branches or pull requests

6 participants