Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] High thread contention in BFCArena #21916

Open
gootorov opened this issue Aug 29, 2024 · 2 comments
Open

[Performance] High thread contention in BFCArena #21916

gootorov opened this issue Aug 29, 2024 · 2 comments
Labels
core runtime issues related to core runtime performance issues related to performance regressions stale issues that have not been addressed in a while; categorized by a bot

Comments

@gootorov
Copy link

Describe the issue

Hi,

I've noticed that a significant chunk of time is spent on locks inside onnxruntime. Specifically, inside BFCArena::AllocateRawInternal

std::lock_guard<OrtMutex> lock(lock_);

The conditions are as follows:

  • Single Session object in the whole application
  • CPU execution provider
  • Many threads (at least 40) call Session.Run at the same time
  • The session has intra_threads and inter_threads set 1, execution_mode set to SEQUENTIAL, arena allocator enabled, memory pattern optimization enabled

See flamegraph screenshots below:
image
image
image

strace shows that 92% of the application time is spent in futex calls:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 92.30 7155.519048        1521   4701931    311073 futex
  4.54  351.586941       52798      6659           clock_nanosleep
  2.03  157.700755         874    180328           epoll_wait
  0.75   58.015816          17   3245709           write
  0.23   17.686325          19    884903           sched_yield

Is this an expected BFCArena limitation, or is it something misconfigured on my side?

I'm expecting that having a Session object per worker thread should eliminate contention. However, I've seen developers here discourage people from setups like this. Why? What are the drawbacks? I'm assuming increased memory consumption (this is fine for me), anything else?

And if that is indeed an expected limitation, then, I'd say this needs some improvement. For example, a caller could pass their own BFCArena instance to Session.Run(), or BFCArena could track each thread_id and keep an array of arenas per each thread.

To reproduce

Initialize a single Session with the following settings:

  • CPU execution provider
  • intra_threads set to 1
  • inter_threads set to 1
  • execution_mode set to SEQUENTIAL
  • arena allocator enabled
  • memory pattern optimization enabled

Then, call Session.Run from many threads concurrently.

Urgency

No response

Platform

Linux

OS Version

NixOS, Gentoo

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

@gootorov gootorov added the performance issues related to performance regressions label Aug 29, 2024
@pranavsharma
Copy link
Contributor

pranavsharma commented Aug 29, 2024

Lock contention in the BFC arena is a known issue. You have a couple of options.

  1. Disable BFC arena altogether
  2. Use mimalloc instead of BFC arena (requires building ORT with mimalloc)
  3. Plugin your own allocator
  4. Disable BFC arena altogether and link with your own allocator

@pranavsharma pranavsharma added the core runtime issues related to core runtime label Aug 29, 2024
Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core runtime issues related to core runtime performance issues related to performance regressions stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

2 participants