Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReentrantLock: wakeup a single task on unlock and add a short spin #56814

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

andrebsguedes
Copy link
Contributor

@andrebsguedes andrebsguedes commented Dec 12, 2024

I propose a change in the implementation of the ReentrantLock to improve its overall throughput for short critical sections and fix the quadratic wake-up behavior where each unlock schedules all waiting tasks on the lock's wait queue.

This implementation follows the same principles of the Mutex in the parking_lot Rust crate which is based on the Webkit WTF::ParkingLot class. Only the basic working principle is implemented here, further improvements such as eventual fairness will be proposed separately.

The gist of the change is that we add one extra state to the lock, essentially going from:

0x0 => The lock is not locked
0x1 => The lock is locked by exactly one task. No other task is waiting for it.
0x2 => The lock is locked and some other task tried to lock but failed (conflict)

To:

# PARKED_BIT | LOCKED_BIT | Description
#     0      |     0      | The lock is not locked, nor is anyone waiting for it.
# -----------+------------+------------------------------------------------------------------
#     0      |     1      | The lock is locked by exactly one task. No other task is
#            |            | waiting for it.
# -----------+------------+------------------------------------------------------------------
#     1      |     0      | The lock is not locked. One or more tasks are parked.
# -----------+------------+------------------------------------------------------------------
#     1      |     1      | The lock is locked by exactly one task. One or more tasks are
#            |            | parked waiting for the lock to become available.
#            |            | In this state, PARKED_BIT is only ever cleared when the cond_wait lock
#            |            | is held (i.e. on unlock). This ensures that
#            |            | we never end up in a situation where there are parked tasks but
#            |            | PARKED_BIT is not set (which would result in those tasks
#            |            | potentially never getting woken up).

In the current implementation we must schedule all tasks to cause a conflict (state 0x2) because on unlock we only notify any task if the lock is in the conflict state. This behavior means that with high contention and a short critical section the tasks will be effectively spinning in the scheduler queue.

With the extra state the proposed implementation has enough information to know if there are other tasks to be notified or not, which means we can always notify one task at a time while preserving the optimized path of not notifying if there are no tasks waiting. To improve throughput for short critical sections we also introduce a bounded amount of spinning before attempting to park.

Results

Not spinning on the scheduler queue greatly reduces the CPU utilization of the following example:

function example()
    lock = ReentrantLock()
    @sync begin
        for i in 1:10000
            Threads.@spawn begin
                @lock lock begin
                    sleep(0.001)
                end
            end
        end
    end
end


@time example()

Current:

28.890623 seconds (101.65 k allocations: 7.646 MiB, 0.25% compilation time)

image

Proposed:

22.806669 seconds (101.65 k allocations: 7.814 MiB, 0.35% compilation time)

image

In a micro-benchmark where 8 threads contend for a single lock with a very short critical section we see a ~2x improvement.

Current:

8-element Vector{Int64}:
 6258688
 5373952
 6651904
 6389760
 6586368
 3899392
 5177344
 5505024
Total iterations: 45842432

Proposed:

8-element Vector{Int64}:
 12320768
 12976128
 10354688
 12845056
  7503872
 13598720
 13860864
 11993088
Total iterations: 95453184

In the uncontended scenario the extra bookkeeping causes a 10% throughput reduction:
EDIT: I reverted _trylock to the simple case to recover the uncontended throughput and now both implementations are on the same ballpark (without hurting the above numbers).

In the uncontended scenario:

Current:

Total iterations: 236748800

Proposed:

Total iterations: 237699072

Closes #56182

@oscardssmith oscardssmith added performance Must go faster multithreading Base.Threads and related functionality labels Dec 12, 2024
@kpamnany kpamnany requested a review from vtjnash December 12, 2024 17:13
# Instead, the backoff is simply capped at a maximum value. This can be
# used to improve throughput in `compare_exchange` loops that have high
# contention.
@inline function spin(iteration::Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this better than just calling yield?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we do not want to leave the core but instead just busy wait in-core for a small number of iterations before we attempt the compare_exchange again, this way if the critical section of the lock is small enough we have a chance to acquire the lock without paying for a OS thread context switch (or a Julia scheduler task switch if you mean Base.yield).
This is the same strategy employed by Rust here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant Base.yield. I think we're in a different situation than Rust since we have M:N threading and a user mode scheduler which Rust doesn't.

@adienes
Copy link
Contributor

adienes commented Dec 12, 2024

(this PR is fantastically written! professional, comprehensive, and easy to follow 👏👏)

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Very well written PR.

# parked if it wants to lock the lock, but it is currently being held by some other task.
const PARKED_BIT = 0b10

const MAX_SPINS = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in log2? The optimal number of spins is chip dependent if I recall the WTF code correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will change to MAX_SPINS_LOG2.

WTF definitely goes the extra mile of optimizing the spin limit, but parking_lot seems to be perform fine with a hardcoded 10 so I just used what they settled on.

Copy link
Contributor Author

@andrebsguedes andrebsguedes Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the number of spins is 10 (cmpxchg attempts), its just the amount of work per spin that increases in each iteration to introduce some backoff.

# contention.
@inline function spin(iteration::Int)
next = iteration >= MAX_SPINS ? MAX_SPINS : iteration + 1
for _ in 1:(1 << next)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code feels a bit "unJulian" I would use max and 2^n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Will change.

@gbaraldi
Copy link
Member

Another point of comparison could be https://github.com/kprotty/usync which is just a "normal" lock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unlock notifies _all_ waiting tasks
5 participants