Deadlock in `signalReceivers.Stop` #1219

ogaca-dd · 2024-06-26T08:48:27Z

Describe the bug
There is a deadlock in signalReceivers.Stop:

syscall.SIGINT is sent to the process
Shutdowner.Shutdown() is called to stop the application
signalReceivers.Stop is called and takes the lock m
signalReceivers.Stop sends a shutdown request (recv.shutdown <- struct{}{}) and blocks on the select
In signalReceivers.relayer, the branch case signal := <-recv.signals: is executed before case <-recv.shutdown
The branch case signal := <-recv.signals: calls signalReceivers.Broadcast
signalReceivers.Broadcast takes the same lock m as signalReceivers.Stop

There is a deadlock.

To Reproduce
Run the following code:

func main() {
	fx.New(fx.Invoke(func(shutdowner fx.Shutdowner) {
		go func() {
			_ = syscall.Kill(syscall.Getpid(), syscall.SIGINT)
			_ = shutdowner.Shutdown()
		}()
	})).Run()
}

Expected behavior
No deadlock

The text was updated successfully, but these errors were encountered:

JacobOaks · 2024-06-26T14:34:28Z

Hey @ogaca-dd thanks for the report, let me look into this.

Can I ask what kinds of situations you're experiencing where a shutdown signal and an OS signal are being sent at roughly the same time?

ogaca-dd · 2024-06-26T15:07:16Z

@JacobOaks
Thanks for your quick answer.

In the situation described above, I am experiencing a dead lock meaning the application try to stop but freeze forever.

JacobOaks · 2024-06-26T15:09:06Z

Hey @ogaca-dd - I understand. My question is why would an application call shutdowner.Shutdown() after it gets killed? The example you provide seems pretty contrived so I'm just trying to better understand the real-world impact here.

ogaca-dd · 2024-06-26T15:39:28Z

The following code is the simplest code I found to reproduce the issue.

func main() {
	fx.New(fx.Invoke(func(shutdowner fx.Shutdowner) {
		go func() {
			_ = syscall.Kill(syscall.Getpid(), syscall.SIGINT)
			_ = shutdowner.Shutdown()
		}()
	})).Run()
}

Our real code looks like

        sigChan := make(chan os.Signal, 1)
	signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM, syscall.SIGPIPE)
	for signo := range sigChan {
		switch signo {
		case syscall.SIGINT, syscall.SIGTERM:
			log.Infof("Received signal %d (%v)", signo, signo)
			 shutdowner.Shutdown()
			return
                }
               // Some code here
       }
}

I know that fx already handles signals and this code should be written differently (This code was not updated during our migration to fx).

Anyway, the code I pointed contains a dead lock which was tricky to find. In our case, this freeze happens randomly:

The frequency of the issue depends on the OS
Adding debugging logs decreases significantly the occurence of the issue which make it harder to catch.

I understand it may not be a correct usage of fx but as a user I don't expect to have my application freeze randomly in a such case.

JacobOaks · 2024-06-26T15:49:32Z

Gotcha. Yeah I was able to reproduce this locally and I will try to figure out the best way to update the signal handlers to prevent this, but yes - as you mentioned - using app.Run will cause Fx to respond to signals automatically. So if you want to workaround this for now you can either:

Not manually call shutdown when you receive a signal
Use app.Start and then call app.Stop when a signal comes into sigChan. Using Start/Stop instead of Run in this way will no longer cause Fx to register its own signal handlers as of v1.22.1.

ogaca-dd · 2024-06-26T15:52:43Z

Thanks for your quick reply and for the workarounds.

A user reported a possible deadlock within the signal receivers (uber-go#1219). This happens by: * `(*signalReceivers).Stop()` is called, by Shutdowner for instance. * `(*signalReceivers).Stop()` acquires the lock. * Separately, an OS signal is sent to the program. * There is a chance that `relayer()` is still running at this point if `(*signalReceivers).Stop()` has not yet sent along the `shutdown` channel. * The relayer attempts to broadcast the signal received via the `signals` channel. * Broadcast()` blocks on trying to acquire the lock. * `(*signalReceivers).Stop()` blocks on waiting for the `relayer()` to finish by blocking on the `finished` channel. * Deadlock. Luckily, this is not a hard deadlock, as `Stop` will return if the context times out. This PR fixes this deadlock. The idea behind how it does it is based on the observation that the broadcasting logic does not necessarily seem to need to share a mutex with the rest of `signalReceivers`. Specifically, it seems like we can separate protection around the registered `wait` and `done` channels, `last`, and the rest of the fields. To avoid overcomplicating `signalReceivers` with multiple locks for different uses, this PR creates a separate `broadcaster` type in charge of keeping track of and broadcasting to `Wait` and `Done` channels. Having a separate broadcaster type seems actually quite natural, so I opted for this to fix the deadlock. Absolutely open to feedback or taking other routes if folks have thoughts. Since broadcasting is protected separately, this deadlock no longer happens since `relayer()` is free to finish its broadcast and then exit. In addition to running the example provided in the original post to verify, I added a test and ran it before/after this change. Before: ``` $ go test -v -count=10 -run "TestSignal/stop_deadlock" . === RUN TestSignal/stop_deadlock signal_test.go:141: Error Trace: /home/user/go/src/github.com/uber-go/fx/signal_test.go:141 Error: Received unexpected error: context deadline exceeded Test: TestSignal/stop_deadlock ``` (the failure appeared roughly 1/3 of the time) After: ``` $ go test -v -count=100 -run "TestSignal/stop_deadlock" . --- PASS: TestSignal (0.00s) --- PASS: TestSignal/stop_deadlock (0.00s) ``` (no failures appeared)

A user reported a possible deadlock within the signal receivers (#1219). This happens by: * `(*signalReceivers).Stop()` is called, by Shutdowner for instance. * `(*signalReceivers).Stop()` [acquires the lock](https://github.com/uber-go/fx/blob/master/signal.go#L121). * Separately, an OS signal is sent to the program. * There is a chance that `relayer()` is still running at this point if `(*signalReceivers).Stop()` has not yet sent along the `shutdown` channel. * The relayer [attempts to broadcast the signal](https://github.com/uber-go/fx/blob/master/signal.go#L93) received via the `signals` channel. * `Broadcast()` blocks on [trying to acquire the lock](https://github.com/uber-go/fx/blob/master/signal.go#L178). * `(*signalReceivers).Stop()` blocks on [waiting for the `relayer()` to finish](https://github.com/uber-go/fx/blob/master/signal.go#L132) by blocking on the `finished` channel. * Deadlock. Luckily, this is not a hard deadlock, as `Stop` will return if the context times out, but we should still fix it. This PR fixes this deadlock. The idea behind how it does it is based on the observation that the broadcasting logic does not necessarily seem to need to share a mutex with the rest of `signalReceivers`. Specifically, it seems like we can separate protection around the registered `wait` and `done` channels, `last`, and the rest of the fields, since the references to those fields are easily isolated. To avoid overcomplicating `signalReceivers` with multiple locks for different uses, this PR creates a separate `broadcaster` type in charge of keeping track of and broadcasting to `Wait` and `Done` channels. Most of the implementation of `broadcaster` is simply moved over from `signalReceivers`. Having a separate broadcaster type seems actually quite natural, so I opted for this to fix the deadlock. Absolutely open to feedback or taking other routes if folks have thoughts. Since broadcasting is protected separately, this deadlock no longer happens since `relayer()` is free to finish its broadcast and then exit. In addition to running the example provided in the original post to verify, I added a test and ran it before/after this change. Before: ``` $ go test -v -count=10 -run "TestSignal/stop_deadlock" . === RUN TestSignal/stop_deadlock signal_test.go:141: Error Trace: /home/user/go/src/github.com/uber-go/fx/signal_test.go:141 Error: Received unexpected error: context deadline exceeded Test: TestSignal/stop_deadlock ``` (the failure appeared roughly 1/3 of the time) After: ``` $ go test -v -count=100 -run "TestSignal/stop_deadlock" . --- PASS: TestSignal (0.00s) --- PASS: TestSignal/stop_deadlock (0.00s) ``` (no failures appeared)

abhinav · 2024-09-03T12:13:43Z

@JacobOaks is this resolved by #1220?

JacobOaks · 2024-09-03T14:47:52Z

Fixed by #1220.

JacobOaks mentioned this issue Jun 26, 2024

Fix deadlock caused by race while signal receivers are stopping #1220

Merged

JacobOaks closed this as completed Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in `signalReceivers.Stop` #1219

Deadlock in `signalReceivers.Stop` #1219

ogaca-dd commented Jun 26, 2024

JacobOaks commented Jun 26, 2024

ogaca-dd commented Jun 26, 2024

JacobOaks commented Jun 26, 2024 •

edited

Loading

ogaca-dd commented Jun 26, 2024 •

edited

Loading

JacobOaks commented Jun 26, 2024

ogaca-dd commented Jun 26, 2024

abhinav commented Sep 3, 2024

JacobOaks commented Sep 3, 2024

Deadlock in signalReceivers.Stop #1219

Deadlock in signalReceivers.Stop #1219

Comments

ogaca-dd commented Jun 26, 2024

JacobOaks commented Jun 26, 2024

ogaca-dd commented Jun 26, 2024

JacobOaks commented Jun 26, 2024 • edited Loading

ogaca-dd commented Jun 26, 2024 • edited Loading

JacobOaks commented Jun 26, 2024

ogaca-dd commented Jun 26, 2024

abhinav commented Sep 3, 2024

JacobOaks commented Sep 3, 2024

Deadlock in `signalReceivers.Stop` #1219

Deadlock in `signalReceivers.Stop` #1219

JacobOaks commented Jun 26, 2024 •

edited

Loading

ogaca-dd commented Jun 26, 2024 •

edited

Loading