Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock caused by race while signal receivers are stopping #1220

Merged
merged 4 commits into from
Jul 2, 2024

Conversation

JacobOaks
Copy link
Contributor

@JacobOaks JacobOaks commented Jun 26, 2024

A user reported a possible deadlock within the signal receivers (#1219).

This happens by:

Luckily, this is not a hard deadlock, as Stop will return if the context times out, but we should still fix it.

This PR fixes this deadlock. The idea behind how it does it is based on the observation that the broadcasting logic does not necessarily seem to need to share a mutex with the rest of signalReceivers. Specifically, it seems like we can separate protection around the registered wait and done channels, last, and the rest of the fields, since the references to those fields are easily isolated. To avoid overcomplicating signalReceivers with multiple locks for different uses, this PR creates a separate broadcaster type in charge of keeping track of and broadcasting to Wait and Done channels. Most of the implementation of broadcaster is simply moved over from signalReceivers.

Having a separate broadcaster type seems actually quite natural, so I opted for this to fix the deadlock. Absolutely open to feedback or taking other routes if folks have thoughts.

Since broadcasting is protected separately, this deadlock no longer happens since relayer() is free to finish its broadcast and then exit.

In addition to running the example provided in the original post to verify, I added a test and ran it before/after this change.

Before:

$ go test -v -count=10 -run "TestSignal/stop_deadlock" .
=== RUN   TestSignal/stop_deadlock
    signal_test.go:141:
                Error Trace:
/home/user/go/src/github.com/uber-go/fx/signal_test.go:141
                Error:          Received unexpected error:
                                context deadline exceeded
                Test:           TestSignal/stop_deadlock

(the failure appeared roughly 1/3 of the time)

After:

$ go test -v -count=100 -run "TestSignal/stop_deadlock" .
--- PASS: TestSignal (0.00s)
    --- PASS: TestSignal/stop_deadlock (0.00s)

(no failures appeared)

A user reported a possible deadlock within the signal receivers (uber-go#1219).

This happens by:
* `(*signalReceivers).Stop()` is called, by Shutdowner for instance.
* `(*signalReceivers).Stop()` acquires the lock.
* Separately, an OS signal is sent to the program.
* There is a chance that `relayer()` is still running at this point
  if `(*signalReceivers).Stop()` has not yet sent along the `shutdown` channel.
* The relayer attempts to broadcast the signal received via the `signals` channel.
* Broadcast()` blocks on trying to acquire the lock.
* `(*signalReceivers).Stop()` blocks on waiting for the `relayer()` to finish
  by blocking on the `finished` channel.
* Deadlock.

Luckily, this is not a hard deadlock, as `Stop` will return if the context times out.

This PR fixes this deadlock. The idea behind how it does it
is based on the observation that the broadcasting logic does not
necessarily seem to need to share a mutex with the rest of `signalReceivers`.
Specifically, it seems like we can separate protection around
the registered `wait` and `done` channels, `last`, and the rest of the fields.
To avoid overcomplicating `signalReceivers` with multiple locks for different uses,
this PR creates a separate `broadcaster` type in charge of keeping track of
and broadcasting to `Wait` and `Done` channels.

Having a separate broadcaster type seems actually quite natural,
so I opted for this to fix the deadlock. Absolutely open to feedback
or taking other routes if folks have thoughts.

Since broadcasting is protected separately, this deadlock no longer happens
since `relayer()` is free to finish its broadcast and then exit.

In addition to running the example provided in the original post to verify,
I added a test and ran it before/after this change.

Before:
```
$ go test -v -count=10 -run "TestSignal/stop_deadlock" .
=== RUN   TestSignal/stop_deadlock
    signal_test.go:141:
                Error Trace:
/home/user/go/src/github.com/uber-go/fx/signal_test.go:141
                Error:          Received unexpected error:
                                context deadline exceeded
                Test:           TestSignal/stop_deadlock
```
(the failure appeared roughly 1/3 of the time)

After:
```
$ go test -v -count=100 -run "TestSignal/stop_deadlock" .
--- PASS: TestSignal (0.00s)
    --- PASS: TestSignal/stop_deadlock (0.00s)
```
(no failures appeared)
@JacobOaks JacobOaks marked this pull request as ready for review June 26, 2024 21:17
Copy link

codecov bot commented Jun 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.42%. Comparing base (74d9643) to head (d2c9e53).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1220   +/-   ##
=======================================
  Coverage   98.41%   98.42%           
=======================================
  Files          34       35    +1     
  Lines        2909     2918    +9     
=======================================
+ Hits         2863     2872    +9     
  Misses         38       38           
  Partials        8        8           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

broadcast.go Outdated Show resolved Hide resolved
broadcast.go Show resolved Hide resolved
@lverma14
Copy link
Contributor

On behalf of group code review by @sywhang & @r-hang

@JacobOaks JacobOaks merged commit 6fde730 into uber-go:master Jul 2, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants