Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference speed problem even if using a high-end Hardware. #19865

Open
Deckard-2049 opened this issue Mar 12, 2024 · 3 comments
Open

Inference speed problem even if using a high-end Hardware. #19865

Deckard-2049 opened this issue Mar 12, 2024 · 3 comments

Comments

@Deckard-2049
Copy link

Describe the issue

We have trained ultralytics yolov8 model on 1024*1024, 3 channel images and converted to onnx and ran that onnx in visual studio 2022 c# .net v4.8 with onnxruntime-gpu v1.16.3 and it's taking around 90 ms on A5000 GPU.
We also tried different onnxruntime sessions options like : Graph Optimization Level, inter_op_num_threads, intra_op_num_threads, Execution mode (ORT_PARALLEL and ORT_SEQUENTIAL), Optimization Options (enable_mem_pattern).
But still there is no difference in the inference time.
So can anyone suggest if we are missing something or how we can reduce the time further even a bit?

To reproduce

Nothing to mention.

Urgency

Yes, it's urgent, Please do help.

Platform

Windows

OS Version

10

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.16.3

ONNX Runtime API

C#

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform labels Mar 12, 2024
@yuslepukhin
Copy link
Member

Can you share the converted model?

Also anyone that uses C# API would benefit from reading this

@Deckard-2049
Copy link
Author

Deckard-2049 commented Mar 13, 2024

actually, we ran this converted model on rtx-4090 and our inference time was 35 ms but when we are trying to run it on rtx-A5000, we are getting somewhere around 90 ms. We are using the same version of CUDA (11.2) here for both the devices. We want to deploy this model on rtx-A5000 with the acceptable inference time of 35-40 ms. What are the possible reasons because of which we are encountering this time issue?

Also, the onnx model has nms embedded in it. We wrote a script to embed the nms inside the onnx model, I have attatched the script file alongside here for reference.
adding_nms.py-20240313T133344Z-001.zip

unfortunately, i cannot share the model, our company policy restricts the sharing of proprietary models. So, we would appreciate some suggestions and solutions on our queries, would be of great help.

@yuslepukhin
Copy link
Member

https://onnxruntime.ai/docs/performance/tune-performance/

@sophies927 sophies927 removed the ep:CUDA issues related to the CUDA execution provider label Mar 14, 2024
@sophies927 sophies927 removed the platform:windows issues related to the Windows platform label Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants