-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in signalReceivers.Stop
#1219
Comments
Hey @ogaca-dd thanks for the report, let me look into this. Can I ask what kinds of situations you're experiencing where a shutdown signal and an OS signal are being sent at roughly the same time? |
@JacobOaks In the situation described above, I am experiencing a dead lock meaning the application try to stop but freeze forever. |
Hey @ogaca-dd - I understand. My question is why would an application call |
The following code is the simplest code I found to reproduce the issue. func main() {
fx.New(fx.Invoke(func(shutdowner fx.Shutdowner) {
go func() {
_ = syscall.Kill(syscall.Getpid(), syscall.SIGINT)
_ = shutdowner.Shutdown()
}()
})).Run()
} Our real code looks like sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM, syscall.SIGPIPE)
for signo := range sigChan {
switch signo {
case syscall.SIGINT, syscall.SIGTERM:
log.Infof("Received signal %d (%v)", signo, signo)
shutdowner.Shutdown()
return
}
// Some code here
}
} I know that fx already handles signals and this code should be written differently (This code was not updated during our migration to fx). Anyway, the code I pointed contains a dead lock which was tricky to find. In our case, this freeze happens randomly:
I understand it may not be a correct usage of fx but as a user I don't expect to have my application freeze randomly in a such case. |
Gotcha. Yeah I was able to reproduce this locally and I will try to figure out the best way to update the signal handlers to prevent this, but yes - as you mentioned - using
|
Thanks for your quick reply and for the workarounds. |
A user reported a possible deadlock within the signal receivers (uber-go#1219). This happens by: * `(*signalReceivers).Stop()` is called, by Shutdowner for instance. * `(*signalReceivers).Stop()` acquires the lock. * Separately, an OS signal is sent to the program. * There is a chance that `relayer()` is still running at this point if `(*signalReceivers).Stop()` has not yet sent along the `shutdown` channel. * The relayer attempts to broadcast the signal received via the `signals` channel. * Broadcast()` blocks on trying to acquire the lock. * `(*signalReceivers).Stop()` blocks on waiting for the `relayer()` to finish by blocking on the `finished` channel. * Deadlock. Luckily, this is not a hard deadlock, as `Stop` will return if the context times out. This PR fixes this deadlock. The idea behind how it does it is based on the observation that the broadcasting logic does not necessarily seem to need to share a mutex with the rest of `signalReceivers`. Specifically, it seems like we can separate protection around the registered `wait` and `done` channels, `last`, and the rest of the fields. To avoid overcomplicating `signalReceivers` with multiple locks for different uses, this PR creates a separate `broadcaster` type in charge of keeping track of and broadcasting to `Wait` and `Done` channels. Having a separate broadcaster type seems actually quite natural, so I opted for this to fix the deadlock. Absolutely open to feedback or taking other routes if folks have thoughts. Since broadcasting is protected separately, this deadlock no longer happens since `relayer()` is free to finish its broadcast and then exit. In addition to running the example provided in the original post to verify, I added a test and ran it before/after this change. Before: ``` $ go test -v -count=10 -run "TestSignal/stop_deadlock" . === RUN TestSignal/stop_deadlock signal_test.go:141: Error Trace: /home/user/go/src/github.com/uber-go/fx/signal_test.go:141 Error: Received unexpected error: context deadline exceeded Test: TestSignal/stop_deadlock ``` (the failure appeared roughly 1/3 of the time) After: ``` $ go test -v -count=100 -run "TestSignal/stop_deadlock" . --- PASS: TestSignal (0.00s) --- PASS: TestSignal/stop_deadlock (0.00s) ``` (no failures appeared)
A user reported a possible deadlock within the signal receivers (#1219). This happens by: * `(*signalReceivers).Stop()` is called, by Shutdowner for instance. * `(*signalReceivers).Stop()` [acquires the lock](https://github.com/uber-go/fx/blob/master/signal.go#L121). * Separately, an OS signal is sent to the program. * There is a chance that `relayer()` is still running at this point if `(*signalReceivers).Stop()` has not yet sent along the `shutdown` channel. * The relayer [attempts to broadcast the signal](https://github.com/uber-go/fx/blob/master/signal.go#L93) received via the `signals` channel. * `Broadcast()` blocks on [trying to acquire the lock](https://github.com/uber-go/fx/blob/master/signal.go#L178). * `(*signalReceivers).Stop()` blocks on [waiting for the `relayer()` to finish](https://github.com/uber-go/fx/blob/master/signal.go#L132) by blocking on the `finished` channel. * Deadlock. Luckily, this is not a hard deadlock, as `Stop` will return if the context times out, but we should still fix it. This PR fixes this deadlock. The idea behind how it does it is based on the observation that the broadcasting logic does not necessarily seem to need to share a mutex with the rest of `signalReceivers`. Specifically, it seems like we can separate protection around the registered `wait` and `done` channels, `last`, and the rest of the fields, since the references to those fields are easily isolated. To avoid overcomplicating `signalReceivers` with multiple locks for different uses, this PR creates a separate `broadcaster` type in charge of keeping track of and broadcasting to `Wait` and `Done` channels. Most of the implementation of `broadcaster` is simply moved over from `signalReceivers`. Having a separate broadcaster type seems actually quite natural, so I opted for this to fix the deadlock. Absolutely open to feedback or taking other routes if folks have thoughts. Since broadcasting is protected separately, this deadlock no longer happens since `relayer()` is free to finish its broadcast and then exit. In addition to running the example provided in the original post to verify, I added a test and ran it before/after this change. Before: ``` $ go test -v -count=10 -run "TestSignal/stop_deadlock" . === RUN TestSignal/stop_deadlock signal_test.go:141: Error Trace: /home/user/go/src/github.com/uber-go/fx/signal_test.go:141 Error: Received unexpected error: context deadline exceeded Test: TestSignal/stop_deadlock ``` (the failure appeared roughly 1/3 of the time) After: ``` $ go test -v -count=100 -run "TestSignal/stop_deadlock" . --- PASS: TestSignal (0.00s) --- PASS: TestSignal/stop_deadlock (0.00s) ``` (no failures appeared)
@JacobOaks is this resolved by #1220? |
Fixed by #1220. |
Describe the bug
There is a deadlock in signalReceivers.Stop:
syscall.SIGINT
is sent to the processShutdowner.Shutdown()
is called to stop the applicationm
as signalReceivers.StopThere is a deadlock.
To Reproduce
Run the following code:
Expected behavior
No deadlock
The text was updated successfully, but these errors were encountered: