Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] In ONNX Runtime, the CPU consumption does not scale linearly with the number of threads #19384

Open
bluishwhite opened this issue Feb 2, 2024 · 9 comments
Labels
core runtime issues related to core runtime performance issues related to performance regressions

Comments

@bluishwhite
Copy link

bluishwhite commented Feb 2, 2024

Hello, I have meet a problem in C++ onnxruntime。

The program has only one onnx model, when the threads up, the program will creat a new session->run(). In the program, I found that when I have 4 threads to deal with the 4 requests , it cost 1cpu with rft 1.0.
When limiting the CPU cores to 4 and using 16 threads to handle 16 requests, the RTF ranges from 2.19 -3.7, the avarage rtf is around 3.2.
the session options is :
session_options_.SetIntraOpNumThreads(1);

Refer the issue: OnnxRuntime multithreading efficiency is poor I change the session option to
session_options_.SetIntraOpNumThreads(1); session_options_.SetInterOpNumThreads(1); session_options_.DisableMemPattern(); session_options_.SetExecutionMode(ORT_SEQUENTIAL);
The avarage rtf is aroud 2.4.

The deadline is looming, and time is running out for me 😢
How can I further optimize to achieve a more linear relationship between CPU consumption and concurrency? The ideal RTF is roud 1.0.(16 threads to handle 16 requests with 4 cpu)

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Version or Commit ID

1.12.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Tasks

Preview Give feedback
No tasks being tracked yet.
@yufenglee yufenglee added the core runtime issues related to core runtime label Feb 2, 2024
@yufenglee
Copy link
Member

As you only have 4 cores, why do you create 16 threads?

@pranavsharma
Copy link
Contributor

First, you're using a version of ORT that is 4 releases old. Second, as Yufeng said above, it's not clear why you've 16 threads on a 4 core machine. What is rtf?

@bluishwhite
Copy link
Author

Thanks for your reply.
@yufenglee Constrained by resources, I aim to utilize as few CPUs as possible to support as many concurrent threads as feasible. Upon testing my program, one CPU can accommodate 4 threads. If the relationship between CPUs and threads is proportional, then 4 CPUs can sustain 16 threads.

Besides, I found that when I use docker to creat a few container to run my onnxruntime program in different processor with differe cpu core id. As the number of containers increases, the CPU load among these containers will mutually influence each other. When I have two container, the cpu usage of every containe is around 80%, when have three container, the cpu usage of every container is 90%, when I have 4, the cpu usage is round 100%.

@pranavsharma RTF(Real Time Factor) = total_audio / total_time_taken, which is served as a performance evaluation metric. The lower RTF, the better performance.
Yeah, the version of onnxruntime is too old. I will change the a new version.
Thanks for you.

@poor1017
Copy link

poor1017 commented Feb 4, 2024

@yufenglee Hi,
We encountered a similar problem. We bound a container A running onnxruntime program to a certain CPU processor, and another container B running the same onnxruntime program to another CPU processor. If container A or container B runs alone, their CPU load remains at about 50%, but if they run at the same time, their CPU load rises to 80%.

We measured the CPU cycles of session.Run() and found that it was the main cause of increased CPU load.

Are there any Ort configuration options that can eliminate this impact between containers?

@sophies927 sophies927 added the performance issues related to performance regressions label Feb 8, 2024
Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Mar 10, 2024
@radikalliberal
Copy link
Contributor

Hi, before this issue gets closed.
I also have the same problem. When running many threads at the same time session.run is slow.
I thought it might have something to do with memory allocation for the input tensors but I could rule that out.
There is some kind of synchronization in session.run .
Can somebody of the dev team tell us why this is necessary?

@github-actions github-actions bot removed the stale issues that have not been addressed in a while; categorized by a bot label Mar 16, 2024
@poor1017
Copy link

Hi, before this issue gets closed. I also have the same problem. When running many threads at the same time session.run is slow. I thought it might have something to do with memory allocation for the input tensors but I could rule that out. There is some kind of synchronization in session.run . Can somebody of the dev team tell us why this is necessary?

In my situation, it's due to NUMA architecture. Session option may help, such as enable_spinning_lock.

@radikalliberal
Copy link
Contributor

thanks @poor1017 that was a great hint.
I think you are right this is NUMA.
Im running ORT under C++ and when I create the session in the thread its executed in the forwardtimes reduce significantly and are almost on par with single threaded performance.
My suggestion is to try to create each session in its dedicated thread and only run it in the same thread.

@radikalliberal
Copy link
Contributor

Hi,
I have to revise my answer. Measuring multithreaded performance was not as straight forward as i had exprected. Higher forward times may have occurde due to CPUs beeing throttled because they were idle beforhand. When we instantiate the session in the thread, throttling has already stopped. So this seems not to be a memory issue.
I was not able to measure signifanct differences when allocating in a the main thread and then forwarding in another.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core runtime issues related to core runtime performance issues related to performance regressions
Projects
None yet
Development

No branches or pull requests

6 participants