Skip to content

Commit

Permalink
Fix deadlock caused by race while signal receivers are stopping
Browse files Browse the repository at this point in the history
A user reported a possible deadlock within the signal receivers (#1219).

This happens by:
* `(*signalReceivers).Stop()` is called, by Shutdowner for instance.
* `(*signalReceivers).Stop()` acquires the lock.
* Separately, an OS signal is sent to the program.
* There is a chance that `relayer()` is still running at this point
  if `(*signalReceivers).Stop()` has not yet sent along the `shutdown` channel.
* The relayer attempts to broadcast the signal received via the `signals` channel.
* Broadcast()` blocks on trying to acquire the lock.
* `(*signalReceivers).Stop()` blocks on waiting for the `relayer()` to finish
  by blocking on the `finished` channel.
* Deadlock.

Luckily, this is not a hard deadlock, as `Stop` will return if the context times out.

This PR fixes this deadlock. The idea behind how it does it
is based on the observation that the broadcasting logic does not
necessarily seem to need to share a mutex with the rest of `signalReceivers`.
Specifically, it seems like we can separate protection around
the registered `wait` and `done` channels, `last`, and the rest of the fields.
To avoid overcomplicating `signalReceivers` with multiple locks for different uses,
this PR creates a separate `broadcaster` type in charge of keeping track of
and broadcasting to `Wait` and `Done` channels.

Having a separate broadcaster type seems actually quite natural,
so I opted for this to fix the deadlock. Absolutely open to feedback
or taking other routes if folks have thoughts.

Since broadcasting is protected separately, this deadlock no longer happens
since `relayer()` is free to finish its broadcast and then exit.

In addition to running the example provided in the original post to verify,
I added a test and ran it before/after this change.

Before:
```
$ go test -v -count=10 -run "TestSignal/stop_deadlock" .
=== RUN   TestSignal/stop_deadlock
    signal_test.go:141:
                Error Trace:
/home/user/go/src/github.com/uber-go/fx/signal_test.go:141
                Error:          Received unexpected error:
                                context deadline exceeded
                Test:           TestSignal/stop_deadlock
```
(the failure appeared roughly 1/3 of the time)

After:
```
$ go test -v -count=100 -run "TestSignal/stop_deadlock" .
--- PASS: TestSignal (0.00s)
    --- PASS: TestSignal/stop_deadlock (0.00s)
```
(no failures appeared)
  • Loading branch information
JacobOaks committed Jun 26, 2024
1 parent 74d9643 commit fc310bd
Show file tree
Hide file tree
Showing 4 changed files with 212 additions and 123 deletions.
176 changes: 176 additions & 0 deletions broadcast.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
// Copyright (c) 2024 Uber Technologies, Inc.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.

package fx

import (
"fmt"
"os"
"sync"
)

// broadcaster broadcasts signals to registered signal listeners.
// All methods on the broadcaster are concurrency-safe.
type broadcaster struct {
// this protects reads/writes to any of the fields in the struct
m sync.Mutex

// last will contain a pointer to the last ShutdownSignal received, or
// nil if none, if a new channel is created by Wait or Done, this last
// signal will be immediately written to, this allows Wait or Done state
// to be read after application stop
last *ShutdownSignal

// contains channels created by Done
done []chan os.Signal

// contains channels created by Wait
wait []chan ShutdownSignal
}

func (b *broadcaster) reset() {
b.m.Lock()
defer b.m.Unlock()
b.last = nil
}

// Wait creates a new channel that will receive signals being broadcast
// via the broadcaster.
//
// If a signal has been received prior to the call of Done,
// the signal will be sent to the new channel.
func (b *broadcaster) Done() <-chan os.Signal {
b.m.Lock()
defer b.m.Unlock()

ch := make(chan os.Signal, 1)
// If we had received a signal prior to the call of done, send it's
// os.Signal to the new channel.
// However we still want to have the operating system notify signals to this
// channel should the application receive another.
if b.last != nil {
ch <- b.last.Signal
}
b.done = append(b.done, ch)
return ch
}

// Wait creates a new channel that will receive signals being broadcast
// via the broadcaster.
//
// If a signal has been received prior to the call of Wait,
// the signal will be sent to the new channel.
func (b *broadcaster) Wait() <-chan ShutdownSignal {
b.m.Lock()
defer b.m.Unlock()

ch := make(chan ShutdownSignal, 1)

if b.last != nil {
ch <- *b.last
}

b.wait = append(b.wait, ch)
return ch
}

// Broadcast sends the given signal to all channels that have been created
// via Done or Wait. It does not block on sending, and returns an unsentSignalError
// if any send did not go through.
func (b *broadcaster) Broadcast(signal ShutdownSignal) error {
b.m.Lock()
defer b.m.Unlock()

b.last = &signal

channels, unsent := b.broadcast(
signal,
b.broadcastDone,
b.broadcastWait,
)

if unsent != 0 {
return &unsentSignalError{
Signal: signal,
Total: channels,
Unsent: unsent,
}
}

return nil
}

func (b *broadcaster) broadcast(
signal ShutdownSignal,
anchors ...func(ShutdownSignal) (int, int),
) (int, int) {
var channels, unsent int

for _, anchor := range anchors {
c, u := anchor(signal)
channels += c
unsent += u
}

return channels, unsent
}

func (b *broadcaster) broadcastDone(signal ShutdownSignal) (int, int) {
var unsent int

for _, reader := range b.done {
select {
case reader <- signal.Signal:
default:
unsent++
}
}

return len(b.done), unsent
}

func (b *broadcaster) broadcastWait(signal ShutdownSignal) (int, int) {
var unsent int

for _, reader := range b.wait {
select {
case reader <- signal:
default:
unsent++
}
}

return len(b.wait), unsent
}

type unsentSignalError struct {
Signal ShutdownSignal
Unsent int
Total int
}

func (err *unsentSignalError) Error() string {
return fmt.Sprintf(
"send %v signal: %v/%v channels are blocked",
err.Signal,
err.Unsent,
err.Total,
)
}
2 changes: 1 addition & 1 deletion shutdown.go
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ func (s *shutdowner) Shutdown(opts ...ShutdownOption) error {
opt.apply(s)
}

return s.app.receivers.Broadcast(ShutdownSignal{
return s.app.receivers.b.Broadcast(ShutdownSignal{
Signal: _sigTERM,
ExitCode: s.exitCode,
})
Expand Down
130 changes: 10 additions & 120 deletions signal.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,12 @@ func newSignalReceivers() signalReceivers {
notify: signal.Notify,
stopNotify: signal.Stop,
signals: make(chan os.Signal, 1),
b: &broadcaster{},
}
}

// signalReceivers listens to OS signals and shutdown signals,
// and relays them to registered listeners when started.
type signalReceivers struct {
// this mutex protects writes and reads of this struct to prevent
// race conditions in a parallel execution pattern
Expand All @@ -68,17 +71,9 @@ type signalReceivers struct {
notify func(c chan<- os.Signal, sig ...os.Signal)
stopNotify func(c chan<- os.Signal)

// last will contain a pointer to the last ShutdownSignal received, or
// nil if none, if a new channel is created by Wait or Done, this last
// signal will be immediately written to, this allows Wait or Done state
// to be read after application stop
last *ShutdownSignal

// contains channels created by Done
done []chan os.Signal

// contains channels created by Wait
wait []chan ShutdownSignal
// used to register and broadcast to signal listeners
// created via Done and Wait
b *broadcaster
}

func (recv *signalReceivers) relayer() {
Expand All @@ -90,7 +85,7 @@ func (recv *signalReceivers) relayer() {
case <-recv.shutdown:
return
case signal := <-recv.signals:
recv.Broadcast(ShutdownSignal{
recv.b.Broadcast(ShutdownSignal{
Signal: signal,
})
}
Expand Down Expand Up @@ -137,120 +132,15 @@ func (recv *signalReceivers) Stop(ctx context.Context) error {
close(recv.finished)
recv.shutdown = nil
recv.finished = nil
recv.last = nil
recv.b.reset()
return nil
}
}

func (recv *signalReceivers) Done() <-chan os.Signal {
recv.m.Lock()
defer recv.m.Unlock()

ch := make(chan os.Signal, 1)

// If we had received a signal prior to the call of done, send it's
// os.Signal to the new channel.
// However we still want to have the operating system notify signals to this
// channel should the application receive another.
if recv.last != nil {
ch <- recv.last.Signal
}

recv.done = append(recv.done, ch)
return ch
return recv.b.Done()
}

func (recv *signalReceivers) Wait() <-chan ShutdownSignal {
recv.m.Lock()
defer recv.m.Unlock()

ch := make(chan ShutdownSignal, 1)

if recv.last != nil {
ch <- *recv.last
}

recv.wait = append(recv.wait, ch)
return ch
}

func (recv *signalReceivers) Broadcast(signal ShutdownSignal) error {
recv.m.Lock()
defer recv.m.Unlock()

recv.last = &signal

channels, unsent := recv.broadcast(
signal,
recv.broadcastDone,
recv.broadcastWait,
)

if unsent != 0 {
return &unsentSignalError{
Signal: signal,
Total: channels,
Unsent: unsent,
}
}

return nil
}

func (recv *signalReceivers) broadcast(
signal ShutdownSignal,
anchors ...func(ShutdownSignal) (int, int),
) (int, int) {
var channels, unsent int

for _, anchor := range anchors {
c, u := anchor(signal)
channels += c
unsent += u
}

return channels, unsent
}

func (recv *signalReceivers) broadcastDone(signal ShutdownSignal) (int, int) {
var unsent int

for _, reader := range recv.done {
select {
case reader <- signal.Signal:
default:
unsent++
}
}

return len(recv.done), unsent
}

func (recv *signalReceivers) broadcastWait(signal ShutdownSignal) (int, int) {
var unsent int

for _, reader := range recv.wait {
select {
case reader <- signal:
default:
unsent++
}
}

return len(recv.wait), unsent
}

type unsentSignalError struct {
Signal ShutdownSignal
Unsent int
Total int
}

func (err *unsentSignalError) Error() string {
return fmt.Sprintf(
"send %v signal: %v/%v channels are blocked",
err.Signal,
err.Unsent,
err.Total,
)
return recv.b.Wait()
}
27 changes: 25 additions & 2 deletions signal_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"os"
"syscall"
"testing"
"time"

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
Expand Down Expand Up @@ -56,9 +57,9 @@ func TestSignal(t *testing.T) {
Signal: syscall.SIGTERM,
}

require.NoError(t, recv.Broadcast(expected), "first broadcast should succeed")
require.NoError(t, recv.b.Broadcast(expected), "first broadcast should succeed")

assertUnsentSignalError(t, recv.Broadcast(expected), &unsentSignalError{
assertUnsentSignalError(t, recv.b.Broadcast(expected), &unsentSignalError{
Signal: expected,
Total: 2,
Unsent: 2,
Expand Down Expand Up @@ -117,4 +118,26 @@ func TestSignal(t *testing.T) {
})
})
})

t.Run("stop deadlock", func(t *testing.T) {
recv := newSignalReceivers()

var notify chan<- os.Signal
recv.notify = func(ch chan<- os.Signal, _ ...os.Signal) {
notify = ch
}
recv.Start()

// Artificially create a race where the relayer receives an OS signal
// while Stop() holds the lock. If this leads to deadlock,
// we will receive a context timeout error.
gotErr := make(chan error, 1)
notify <- syscall.SIGTERM
go func() {
stopCtx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
gotErr <- recv.Stop(stopCtx)
}()
assert.NoError(t, <-gotErr)
})
}

0 comments on commit fc310bd

Please sign in to comment.