Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in signalReceivers.Stop #1219

Closed
ogaca-dd opened this issue Jun 26, 2024 · 8 comments
Closed

Deadlock in signalReceivers.Stop #1219

ogaca-dd opened this issue Jun 26, 2024 · 8 comments

Comments

@ogaca-dd
Copy link

Describe the bug
There is a deadlock in signalReceivers.Stop:

There is a deadlock.

To Reproduce
Run the following code:

func main() {
	fx.New(fx.Invoke(func(shutdowner fx.Shutdowner) {
		go func() {
			_ = syscall.Kill(syscall.Getpid(), syscall.SIGINT)
			_ = shutdowner.Shutdown()
		}()
	})).Run()
}

Expected behavior
No deadlock

@JacobOaks
Copy link
Contributor

Hey @ogaca-dd thanks for the report, let me look into this.

Can I ask what kinds of situations you're experiencing where a shutdown signal and an OS signal are being sent at roughly the same time?

@ogaca-dd
Copy link
Author

@JacobOaks
Thanks for your quick answer.

In the situation described above, I am experiencing a dead lock meaning the application try to stop but freeze forever.

@JacobOaks
Copy link
Contributor

JacobOaks commented Jun 26, 2024

Hey @ogaca-dd - I understand. My question is why would an application call shutdowner.Shutdown() after it gets killed? The example you provide seems pretty contrived so I'm just trying to better understand the real-world impact here.

@ogaca-dd
Copy link
Author

ogaca-dd commented Jun 26, 2024

The following code is the simplest code I found to reproduce the issue.

func main() {
	fx.New(fx.Invoke(func(shutdowner fx.Shutdowner) {
		go func() {
			_ = syscall.Kill(syscall.Getpid(), syscall.SIGINT)
			_ = shutdowner.Shutdown()
		}()
	})).Run()
}

Our real code looks like

        sigChan := make(chan os.Signal, 1)
	signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM, syscall.SIGPIPE)
	for signo := range sigChan {
		switch signo {
		case syscall.SIGINT, syscall.SIGTERM:
			log.Infof("Received signal %d (%v)", signo, signo)
			 shutdowner.Shutdown()
			return
                }
               // Some code here
       }
}

I know that fx already handles signals and this code should be written differently (This code was not updated during our migration to fx).

Anyway, the code I pointed contains a dead lock which was tricky to find. In our case, this freeze happens randomly:

  • The frequency of the issue depends on the OS
  • Adding debugging logs decreases significantly the occurence of the issue which make it harder to catch.

I understand it may not be a correct usage of fx but as a user I don't expect to have my application freeze randomly in a such case.

@JacobOaks
Copy link
Contributor

Gotcha. Yeah I was able to reproduce this locally and I will try to figure out the best way to update the signal handlers to prevent this, but yes - as you mentioned - using app.Run will cause Fx to respond to signals automatically. So if you want to workaround this for now you can either:

  • Not manually call shutdown when you receive a signal
  • Use app.Start and then call app.Stop when a signal comes into sigChan. Using Start/Stop instead of Run in this way will no longer cause Fx to register its own signal handlers as of v1.22.1.

@ogaca-dd
Copy link
Author

Thanks for your quick reply and for the workarounds.

JacobOaks added a commit to JacobOaks/fx that referenced this issue Jun 26, 2024
A user reported a possible deadlock within the signal receivers (uber-go#1219).

This happens by:
* `(*signalReceivers).Stop()` is called, by Shutdowner for instance.
* `(*signalReceivers).Stop()` acquires the lock.
* Separately, an OS signal is sent to the program.
* There is a chance that `relayer()` is still running at this point
  if `(*signalReceivers).Stop()` has not yet sent along the `shutdown` channel.
* The relayer attempts to broadcast the signal received via the `signals` channel.
* Broadcast()` blocks on trying to acquire the lock.
* `(*signalReceivers).Stop()` blocks on waiting for the `relayer()` to finish
  by blocking on the `finished` channel.
* Deadlock.

Luckily, this is not a hard deadlock, as `Stop` will return if the context times out.

This PR fixes this deadlock. The idea behind how it does it
is based on the observation that the broadcasting logic does not
necessarily seem to need to share a mutex with the rest of `signalReceivers`.
Specifically, it seems like we can separate protection around
the registered `wait` and `done` channels, `last`, and the rest of the fields.
To avoid overcomplicating `signalReceivers` with multiple locks for different uses,
this PR creates a separate `broadcaster` type in charge of keeping track of
and broadcasting to `Wait` and `Done` channels.

Having a separate broadcaster type seems actually quite natural,
so I opted for this to fix the deadlock. Absolutely open to feedback
or taking other routes if folks have thoughts.

Since broadcasting is protected separately, this deadlock no longer happens
since `relayer()` is free to finish its broadcast and then exit.

In addition to running the example provided in the original post to verify,
I added a test and ran it before/after this change.

Before:
```
$ go test -v -count=10 -run "TestSignal/stop_deadlock" .
=== RUN   TestSignal/stop_deadlock
    signal_test.go:141:
                Error Trace:
/home/user/go/src/github.com/uber-go/fx/signal_test.go:141
                Error:          Received unexpected error:
                                context deadline exceeded
                Test:           TestSignal/stop_deadlock
```
(the failure appeared roughly 1/3 of the time)

After:
```
$ go test -v -count=100 -run "TestSignal/stop_deadlock" .
--- PASS: TestSignal (0.00s)
    --- PASS: TestSignal/stop_deadlock (0.00s)
```
(no failures appeared)
JacobOaks added a commit that referenced this issue Jul 2, 2024
A user reported a possible deadlock within the signal receivers (#1219).

This happens by:
* `(*signalReceivers).Stop()` is called, by Shutdowner for instance.
* `(*signalReceivers).Stop()` [acquires the
lock](https://github.com/uber-go/fx/blob/master/signal.go#L121).
* Separately, an OS signal is sent to the program.
* There is a chance that `relayer()` is still running at this point if
`(*signalReceivers).Stop()` has not yet sent along the `shutdown`
channel.
* The relayer [attempts to broadcast the
signal](https://github.com/uber-go/fx/blob/master/signal.go#L93)
received via the `signals` channel.
* `Broadcast()` blocks on [trying to acquire the
lock](https://github.com/uber-go/fx/blob/master/signal.go#L178).
* `(*signalReceivers).Stop()` blocks on [waiting for the `relayer()` to
finish](https://github.com/uber-go/fx/blob/master/signal.go#L132) by
blocking on the `finished` channel.
* Deadlock.

Luckily, this is not a hard deadlock, as `Stop` will return if the
context times out, but we should still fix it.

This PR fixes this deadlock. The idea behind how it does it is based on
the observation that the broadcasting logic does not necessarily seem to
need to share a mutex with the rest of `signalReceivers`. Specifically,
it seems like we can separate protection around the registered `wait`
and `done` channels, `last`, and the rest of the fields, since the
references to those fields are easily isolated. To avoid
overcomplicating `signalReceivers` with multiple locks for different
uses, this PR creates a separate `broadcaster` type in charge of keeping
track of and broadcasting to `Wait` and `Done` channels. Most of the
implementation of `broadcaster` is simply moved over from
`signalReceivers`.

Having a separate broadcaster type seems actually quite natural, so I
opted for this to fix the deadlock. Absolutely open to feedback or
taking other routes if folks have thoughts.

Since broadcasting is protected separately, this deadlock no longer
happens since `relayer()` is free to finish its broadcast and then exit.

In addition to running the example provided in the original post to
verify, I added a test and ran it before/after this change.

Before:
```
$ go test -v -count=10 -run "TestSignal/stop_deadlock" .
=== RUN   TestSignal/stop_deadlock
    signal_test.go:141:
                Error Trace:
/home/user/go/src/github.com/uber-go/fx/signal_test.go:141
                Error:          Received unexpected error:
                                context deadline exceeded
                Test:           TestSignal/stop_deadlock
```
(the failure appeared roughly 1/3 of the time)

After:
```
$ go test -v -count=100 -run "TestSignal/stop_deadlock" .
--- PASS: TestSignal (0.00s)
    --- PASS: TestSignal/stop_deadlock (0.00s)
```
(no failures appeared)
@abhinav
Copy link
Collaborator

abhinav commented Sep 3, 2024

@JacobOaks is this resolved by #1220?

@JacobOaks
Copy link
Contributor

Fixed by #1220.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants