-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #666 handle SIGTERM #712
base: rolling
Are you sure you want to change the base?
Conversation
3ee156e
to
708f098
Compare
I had to modify Maybe some sort of global scope I'd argue however that the changes to the tests would be justified even without the rest of this Pull Request. The current upstream behaviour is that |
@fujitatomoya has been a long time since I worked on this, some other regressions were introduced during the rebase also. Agree with your suggestions, but might rework the patches more in general over the next week, take a fresh look at it. |
@ciandonovan i believe your suggestion posted on here makes sense. |
@fujitatomoya @wjwwood @clalancette is there a case for catching and handling any signals other than SIGINT and SIGTERM? Currently only SIGINT is handled as no signal handler on the process level is ever installed for anything else, despite the current code assuming so. My pull request as it stands catches both SIGINT and SIGTERM, and forwards SIGINT to the child processes in both cases. I wrote it to handle any signals installed to the AsyncSafeSignalManager.handle(), but this flexibility might not be required? Also, it doesn't forward any signal if launch is in "noninteractive" mode, which shouldn't be the case for SIGTERM at least as it has different semantics to SIGINT (which is commonly sent to the entire foreground process group on Ctrl-C from the terminal). If it is the case that only those two signals are of interest, I'll adjust to handle both accordingly - both forwarded when nonintereactive, and only SIGTERM forwarded when interactive. The ROS 2 Client Libraries already handle both signals gracefully so shouldn't introduce any unexpected issues by forwarding SIGTERM. |
By default, and only in certain circumstances, SIGINT is the only signal with a handler installed. All others will perform their function defined by SIG_DFL, which is to ignore is some cases, and to terminate in others. In either case, Python will never see them and be unable to pass them to the async handler. Signed-off-by: <[email protected]> Signed-off-by: Cian Donovan <[email protected]>
SIGQUIT signals are often generated by the user hitting 'CTRL+\' in an interactive shell. However, when the terminal driver receives those characters, it doesn't just send a SIGQUIT to the running process, but to the entire foreground process group. By default, the ROS2 Client Libraries don't register a handler for SIGQUIT, as they do for SIGINT and SIGTERM, so the ROS2 node will be terminated immediately and its core dumpted on receipt. So there is little reason for ROS2 Launch to catch this signal for the purposes of forwarding it on to its children. Also, not catching SIGQUIT means that if there is a problem with the Python signal handling, then the user still has available a shortcut on their keyboards to unilaterally kill the process. Signed-off-by: <[email protected]> Signed-off-by: Cian Donovan <[email protected]>
Allows arbitrary signal handlers to easily be added in future without writing new functions, DRY. Also allows at least one signal of each type, printing a warning when more than one is received. This is to allow in future each signal type to be passed to the children of launch, so a user can manually escalate the kill signals. Signed-off-by: Cian Donovan <[email protected]>
Signed-off-by: Cian Donovan <[email protected]>
…t_int_handler to match how signals are now handled in Launch Signed-off-by: Cian Donovan <[email protected]>
@ciandonovan i am not sure what signals |
send_sigint? True : False +----------------+--------+---------+ | | SIGINT | SIGTERM | +----------------+--------+---------+ | interactive | False | True | | noninteractive | True | True | +----------------+--------+---------+ Launch either detects, through checking if there's a controlling terminal, or the value of the flag `--noninteractive`, to determine if launch was launched interactively or not. If it was launched interactively, then the receipt of SIGINT is implicitly assumed to have come from the user pressing Ctrl-C and their terminal sending a SIGINT to the entire foreground process group. In such case, it is redundant to also send SIGINT from launch to the child ROS nodes, however it should be noted that even if another signal was sent there would be no ill-effects. In all other cases, assume the child nodes were not themselves signalled, so manually send SIGINT, and later escalate to SIGTERM after timeout, to the nodes. Signed-off-by: Cian Donovan <[email protected]>
Launch either detects, through checking if there's a controlling terminal, or the value of the flag If it was launched interactively, then the receipt of SIGINT is implicitly assumed to have come from the user pressing Ctrl-C In all other cases, assume the child nodes were not themselves signalled, so manually send SIGINT, and later escalate to SIGTERM after timeout, to the nodes. |
@fujitatomoya from reading the Linux signal man page it's hard to make a case for handling any signals other than SIGINT and SIGTERM. The signals SIGABRT, SIGALRM, SIGCHLD, SIGIO, SIGLOST, SIGPIPE, SIGPOLL, SIGURG, SIGVTALRM, are raised in response to actions taken by the program itself by design. I had considered SIGHUP, but then again if a user explicitly wants their remote processes to continue after a terminal disconnect, then that's already possible with the well-known Further, only SIGINT and SIGTERM are currently handled by the ROS 2 Client Libraries used by the nodes, so forwarding arbitrary other signals to them would just cause them to immediately terminate anyway, which is not desired. I think there could be a bigger conversation around changing some of the existing semantics of Launch, in particular I don't think the current behavior of sending a SIGINT, then a SIGTERM after timeout makes much sense. The POSIX standard doesn't specify any order of precedence between them, and rcl handles them both identically. There is no escalation in practice, nor do the signals' semantics indicate so either. There's also code for handling SIGKILL which just isn't possible as that's an uncatchable signal. However, that should probably be addressed in a separate pull request. As it stands, this pull request is essentially a bugfix. Its behavior is now as it seems it was designed for from reading the docs, the code's comments, and its structure. SIGTERM was never actually handled before at all, as no handler was installed, and now it works quite nicely both being run as a systemd service and in a Podman/Docker container. Also works nice as a Podman container running inside a systemd service ;) |
Set is unordered and nonduplicative, better fits storing whether at least one of each unique signal has been received. Resolve `signum` into the signal's name as a string once instead of twice. Cleanup `due_to_sigint` ternary assignment. Signed-off-by: Cian Donovan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ciandonovan thanks for cleaning after me :)
About signal escalation vs. signal forwarding. launch
was designed to work with non-ROS aware executables too. SIGINT and SIGTERM handling isn't equivalent in general. It does make sense, however, to bypass some of the signal escalation stages based on the signal that triggered it. Essentially skip the SIGINT stage on SIGTERM induced shutdowns, as you suggest somewhere above. HTH.
@@ -235,6 +235,7 @@ def handle( | |||
:return: previous handler if any, otherwise None | |||
""" | |||
signum = signal.Signals(signum) | |||
signal.signal(signum, signal.default_int_handler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ciandonovan thanks for catching this! I dropped the ball here, my bad.
That said, I don't think this is the complete fix. We should only override default signal handlers (chaining non-default ones), and we should restore whatever was in place on AsyncSafeSignalManager
exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I'm not overriding any default handlers, as SIGINT is already handled by signal.default_int_handler
, and SIGTERM isn't handled at all, so I'm just setting it to be handled by the same function.
Ideally both would be handled by some global _noop function that just passes, letting signal.set_wakeup_fd
pick it up and do its job. Not sure the best way to implement that though. There was a _noop function in test_signal_management.py
previously, but the assert fails if a signal handler was installed for SIGTERM as it's pointed to a different handler address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I'm not overriding any default handlers, as SIGINT is already handled by
signal.default_int_handler
, and SIGTERM isn't handled at all, so I'm just setting it to be handled by the same function.
Right, but we don't really know which signal number we will be required to handle, and even if we did (or assumed so or constrained it to be such and such), there is also the possibility that some other library, used by launch
or using launch
, may have set their own as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what should we do here? Is it ok to just override any existing signal handler with when this is called? How does this interact with signal.set_wakeup_fd
? What if someone sets their own signal handler after initializing launch, will our handling still function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only existing signal handler at this point is the one the Python binary sets itself for SIGINT to generate the KeyboardInterrupt
exception, which I don't change as I point all handlers towards that.
The only other possibility is that someone could have created a handler within their launchfile? Not sure if that's even possible though, and would be terrible practice anyway.
Signal handlers they install in their nodes are of course separate, as signals are a per-process property, and launch is but the parent of the node children.
In terms of signal.set_wakeup_fd
, nothing extra needs to be done. Installing the signal handlers here simply prevents the kernel from immediately terminating the process as is the common default disposition, and allows Python to then forward the signal through the fd.
return func(*args, **kwargs) | ||
except KeyboardInterrupt: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ciandonovan hmm, this will skip the entire test with a green check. The asyncio
loop won't deliver any handlers if it is interrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify why, or what it's interrupting? My understanding is that the assert
in the finally
section will always get run no matter what exceptions occur so long as they're handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, those assert
will be executed (and likely succeed), but all the assert
in the test_async_safe_signal_manager
body after the first signal is raised won't because a KeyboardInterrupt
exception will have been raised during asyncio.run_until_complete
execution.
self.__logger.warning(base_msg) | ||
ret = self._shutdown( | ||
reason='ctrl-c (SIGINT)', due_to_sigint=True, force_sync=True | ||
reason=base_msg, due_to_sigint=due_to_sigint, force_sync=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ciandonovan meta: sending a shutdown event is not the same as pulling the brakes on the event loop, but I can see how the previous implementation can cause problems when managing launch
itself. Most process managers use SIGTERM. I guess you could argue that SIGKILL escalation is appropriate if a shutdown isn't enough.
We've been discussing this pr and the ros2run signal handling pr (ros2/ros2cli#899) in our Jazzy release channels and I think the plan is to make this a bugfix shortly after the release, and after it has some time to soak in Rolling. So we'll try to get these merged, and then backport after the release as a bugfix. |
Fixes #666
SIGTERM
Instead of installing signal handlers with the OS directly, ROS2 Launch uses Python's
signal.set_wakeup_fd
to queue and forward signals to handlers registered from the async event loop.Although one of these handlers has been set for SIGTERM, it never gets called since when that signal gets delivered, the kernel terminates the process immediately. The process never changes the default disposition of that signal. Delivery of SIGINT however, doesn't cause the kernel to terminate the process, and instead that signal get dutifully forwarded to the appropriate handler. This is despite there being no difference in the ROS2 Launch code in setting them both up.
After running an strace on a one-line Python program and seeing that a handler was still automatically installed for SIGINT, I came across this . Depending on whether a Python script is run in some combination of foreground process and in an interactive shell, a handler will be installed for SIGINT for use in the
KeyboardInterrupt
exception. It just so happens that it also gets forwarded bysignal.set_wakeup_fd
, and since the handler prevents the program from being automatically terminated, it can actually do so.This is why the handler for SIGINT by chance works, but the one for SIGTERM does not.
The fix for this was to simply install the default interrupt signal handler on any signal the async event loop requested handled, so that they would no longer cause the program to terminate and instead be forwarded to
signal.set_wakeup_fd
.SIGQUIT
SIGQUIT signals are often generated by the user hitting 'CTRL+' in an interactive shell. However, when the terminal driver receives those characters, it doesn't just send a SIGQUIT to the running process, but to the entire foreground process group.
By default, the ROS2 Client Libraries don't register a handler for SIGQUIT, as they do for SIGINT and SIGTERM, so the ROS2 node will be terminated immediately and its core dumpted on receipt. So there is little reason for ROS2 Launch to catch this signal or the purposes of forwarding it on to its children.
Also, not catching SIGQUIT means that if there is a problem with the Python signal handling, then the user still has available a shortcut on their keyboards to unilaterally kill the process.
The signal handler for SIGQUIT was removed.
Future work
This Pull Request allows ROS2 Launch to handle SIGTERMs and initiate shutdown, but sends SIGINTs instead of SIGTERMs to the child processes. This will usually be fine since the ROS2 Client Libraries install handlers for both SIGINT and SIGTERM by default, and is no worse than the current situation, but it would be more flexible to allow users to send whatever signals they want to their nodes.
To do this would require removing the
due_to_sigint
andsend_sigint
arguments of many functions and replacing them with an argument passing the name of the signal to be forwarded. Care would need to be taken of other signals besides SIGINT such as SIGQUIT and SIGTSTP that also tend to get sent to the entire process group to avoid duplicates. Currently if the child node doesn't shutdown within 5 seconds an "escalated" SIGTERM is sent - but if SIGTERM was the initial signal then should it progress straight to SIGKILL? Or wait 10 seconds?If others feel this is worth developing let me know and I'll work on another PR, but I feel this is in good shape now and will already be of benefit to others upstream.
@wjwwood - mentioned in the relevant ToDo for this feature