Fix deadlock caused by race while signal receivers are stopping #1220

JacobOaks · 2024-06-26T21:14:12Z

A user reported a possible deadlock within the signal receivers (#1219).

This happens by:

(*signalReceivers).Stop() is called, by Shutdowner for instance.
(*signalReceivers).Stop() acquires the lock.
Separately, an OS signal is sent to the program.
There is a chance that relayer() is still running at this point if (*signalReceivers).Stop() has not yet sent along the shutdown channel.
The relayer attempts to broadcast the signal received via the signals channel.
Broadcast() blocks on trying to acquire the lock.
(*signalReceivers).Stop() blocks on waiting for the relayer() to finish by blocking on the finished channel.
Deadlock.

Luckily, this is not a hard deadlock, as Stop will return if the context times out, but we should still fix it.

This PR fixes this deadlock. The idea behind how it does it is based on the observation that the broadcasting logic does not necessarily seem to need to share a mutex with the rest of signalReceivers. Specifically, it seems like we can separate protection around the registered wait and done channels, last, and the rest of the fields, since the references to those fields are easily isolated. To avoid overcomplicating signalReceivers with multiple locks for different uses, this PR creates a separate broadcaster type in charge of keeping track of and broadcasting to Wait and Done channels. Most of the implementation of broadcaster is simply moved over from signalReceivers.

Having a separate broadcaster type seems actually quite natural, so I opted for this to fix the deadlock. Absolutely open to feedback or taking other routes if folks have thoughts.

Since broadcasting is protected separately, this deadlock no longer happens since relayer() is free to finish its broadcast and then exit.

In addition to running the example provided in the original post to verify, I added a test and ran it before/after this change.

Before:

$ go test -v -count=10 -run "TestSignal/stop_deadlock" .
=== RUN   TestSignal/stop_deadlock
    signal_test.go:141:
                Error Trace:
/home/user/go/src/github.com/uber-go/fx/signal_test.go:141
                Error:          Received unexpected error:
                                context deadline exceeded
                Test:           TestSignal/stop_deadlock

(the failure appeared roughly 1/3 of the time)

After:

$ go test -v -count=100 -run "TestSignal/stop_deadlock" .
--- PASS: TestSignal (0.00s)
    --- PASS: TestSignal/stop_deadlock (0.00s)

(no failures appeared)

A user reported a possible deadlock within the signal receivers (uber-go#1219). This happens by: * `(*signalReceivers).Stop()` is called, by Shutdowner for instance. * `(*signalReceivers).Stop()` acquires the lock. * Separately, an OS signal is sent to the program. * There is a chance that `relayer()` is still running at this point if `(*signalReceivers).Stop()` has not yet sent along the `shutdown` channel. * The relayer attempts to broadcast the signal received via the `signals` channel. * Broadcast()` blocks on trying to acquire the lock. * `(*signalReceivers).Stop()` blocks on waiting for the `relayer()` to finish by blocking on the `finished` channel. * Deadlock. Luckily, this is not a hard deadlock, as `Stop` will return if the context times out. This PR fixes this deadlock. The idea behind how it does it is based on the observation that the broadcasting logic does not necessarily seem to need to share a mutex with the rest of `signalReceivers`. Specifically, it seems like we can separate protection around the registered `wait` and `done` channels, `last`, and the rest of the fields. To avoid overcomplicating `signalReceivers` with multiple locks for different uses, this PR creates a separate `broadcaster` type in charge of keeping track of and broadcasting to `Wait` and `Done` channels. Having a separate broadcaster type seems actually quite natural, so I opted for this to fix the deadlock. Absolutely open to feedback or taking other routes if folks have thoughts. Since broadcasting is protected separately, this deadlock no longer happens since `relayer()` is free to finish its broadcast and then exit. In addition to running the example provided in the original post to verify, I added a test and ran it before/after this change. Before: ``` $ go test -v -count=10 -run "TestSignal/stop_deadlock" . === RUN TestSignal/stop_deadlock signal_test.go:141: Error Trace: /home/user/go/src/github.com/uber-go/fx/signal_test.go:141 Error: Received unexpected error: context deadline exceeded Test: TestSignal/stop_deadlock ``` (the failure appeared roughly 1/3 of the time) After: ``` $ go test -v -count=100 -run "TestSignal/stop_deadlock" . --- PASS: TestSignal (0.00s) --- PASS: TestSignal/stop_deadlock (0.00s) ``` (no failures appeared)

codecov · 2024-06-26T21:24:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.42%. Comparing base (74d9643) to head (d2c9e53).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1220   +/-   ##
=======================================
  Coverage   98.41%   98.42%           
=======================================
  Files          34       35    +1     
  Lines        2909     2918    +9     
=======================================
+ Hits         2863     2872    +9     
  Misses         38       38           
  Partials        8        8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

broadcast.go

lverma14 · 2024-06-28T21:03:37Z

On behalf of group code review by @sywhang & @r-hang

JacobOaks marked this pull request as ready for review June 26, 2024 21:17

JacobOaks commented Jun 28, 2024

View reviewed changes

broadcast.go Outdated Show resolved Hide resolved

Update broadcast.go

28ab464

lverma14 reviewed Jun 28, 2024

View reviewed changes

broadcast.go Show resolved Hide resolved

JacobOaks added 2 commits July 1, 2024 15:54

elaborate on lock usage

0ab55ae

Fix lint

d2c9e53

sywhang approved these changes Jul 1, 2024

View reviewed changes

r-hang approved these changes Jul 1, 2024

View reviewed changes

JacobOaks merged commit 6fde730 into uber-go:master Jul 2, 2024
12 checks passed

abhinav mentioned this pull request Sep 3, 2024

Deadlock in signalReceivers.Stop #1219

Closed

tchung1118 mentioned this pull request Oct 11, 2024

Preparing release v1.23.0 #1241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock caused by race while signal receivers are stopping #1220

Fix deadlock caused by race while signal receivers are stopping #1220

JacobOaks commented Jun 26, 2024 •

edited

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading

lverma14 commented Jun 28, 2024

Fix deadlock caused by race while signal receivers are stopping #1220

Fix deadlock caused by race while signal receivers are stopping #1220

Conversation

JacobOaks commented Jun 26, 2024 • edited Loading

codecov bot commented Jun 26, 2024 • edited Loading

Codecov Report

lverma14 commented Jun 28, 2024

JacobOaks commented Jun 26, 2024 •

edited

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading