-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1965992: Gracefully shutdown taking around 6-7 mins (libvirt provider) #2631
Bug 1965992: Gracefully shutdown taking around 6-7 mins (libvirt provider) #2631
Conversation
@jkyros: This pull request references Bugzilla bug 1965992, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@jkyros: This pull request references Bugzilla bug 1965992, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
pkg/daemon/update.go
Outdated
@@ -155,6 +155,7 @@ func (dn *Daemon) performPostConfigChangeAction(postConfigChangeActions []string | |||
} | |||
|
|||
// currentConfig != desiredConfig, kick off an update | |||
// TODO: this is recursive back to update(), and now that we do rebootless updates we need to think about this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will usually get called if something in between got changed that lead to a generation of a new config like a new MachineConfig got applied while update was already in progress. This is to make sure that we always update node to desired config.
In regular flow, following will be called.
if inDesiredConfig {
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we really need this comment being added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer card over in-code to-dos, agree with sinny to remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, todo has been removed.
Overall LGTM |
The machine-config-daemon gets stuck blocking SIGTERM on rebootless updatesbecause it only removes its SIGTERM protection when it reboots or when it encounters an error in the triggerUpdateWithmachineConfig-> update->performPostConfigChangeAction cycle. This changes the behavior such that it will remove the protection on a successful rebootless update and adds some logging messages so it's more clear when it starts and stops protecting itself.
2f00d1e
to
7b720c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jkyros, sinnykumari, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@jkyros: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@jkyros: All pull requests linked via external trackers have merged: Bugzilla bug 1965992 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.8 |
@praveenkumar: new pull request created: #2636 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.7 |
@jkyros: new pull request created: #2727 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The machine config daemon's SIGTERM protection was not being removed on
rebootless updates.The existing logic made sense before rebootless updates
were a thing, but now if an update happens and we don't reboot, the MCD
protects itself from SIGTERM forever.
Also, the sequence of functions triggerUpdateWithmachineConfig->
update->performPostConfigChangeAction is recursive, so if we put the
work from #2395 back in to solve this, or try to use the mutex "properly" we'll
potentially deadlock on ourselves under the right conditions.
This PR removes an if condition so the SIGTERM protection is removed
on a successful rebootless update and also adds some logging messages to make
it more apparent when the protection is being added/removed.
Long term we should figure out the desired behavior and "proper" way to
organize this (maybe flatten it into a loop and get rid of the recursion, maybe
decide to just stop after one update round without recursing, etc), but for right
now we at least need to fix the SIGTERM handler because we're negatively
impacting upgrades and rebootless changes.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1965992
Fixes some cases of: https://bugzilla.redhat.com/show_bug.cgi?id=1927041