-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes 'Marking Degraded' due to old/absent machineconfig #2010
Comments
EDIT: nevermind, moving the "day 1" version to #2114 |
There's not an easy way to recover from this right now unfortunately. Basically you shouldn't currently ever try to delete a In order to roll back the correct thing is to delete the custom machineconfig you injected - the MCO will retarget the pool to the previous configuration. |
To add a couple of things:
Deleting a rendered-machineconfig shouldn't really be necessary to fixing a problem. If a bad rendered config gets generated, there are ways to recover the cluster, and the bad config will never be used again. If you really want to delete it, you must make sure nothing is referencing it. You can triple check by:
We don't have garbage collection today but maybe we will add it at some point. For now its fine because it shouldn't be used after you've deleted the bad machineconfig that generated it.
The steps there might work but it's somewhat case-dependent. The MCD should tell you what the current error is when it comes to a rendered MC issue like this. Recovery should be possible with a mix of: manual node annotation editing, forcing an update by skipping validation, and removing references to the bad config on the node, the pool, and journal (most likely by flushing it, if there's a pending state). I'd advise caution unless you know for sure what the issue is and how to recover |
If the Machine Config Operator is unable to detect that a rendered config has been removed, regardless of reason, and regenerate it, then that would be a bug. All operators should be in a constant reconciliation loop and willing to take as much action as necessary to realize the specified config. This clearly looks like a bug in MCC. Am I overlooking something? |
The MCC would not necessarily have the insight to regenerate a rendered-MC. It's there to render the complete state of the current set of MCs and generate a rendered-config based on that. I'm pretty sure if you delete the LATEST rendered config it would attempt to regen, when it realizes that there was a change to the MC. (I'll have to double check this). Bugs like this happen when a user creates a bad MC (which generates a bad rendered-MC), delete that bad MC and then delete that bad rendered-MC before the MCO can properly consolidate. The MCO can no longer generate that rendered MC because it no longer has the bad-MC to generate it from. However, a node may still be referencing it via the |
We could install a finalizer to prevent deletion of any rendered configs which are referenced by a node object. |
@yuqi-zhang ah! Yes, I was overlooking that case. Thanks for the explanation. I like @cgwalters' suggestion to use finalizers. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten still relevant in the future (probably as an overall rework of some sort eventually) |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Description
I'm seeing this error in the machine-config-daemon of a certain node. The node is unable to apply any new machineconfig and keeps on reverting to this state:
The referenced missing machineconfig (
rendered-master-b9b68ece3045f2312d3d8c77bd520822
) does not exist anymore in the cluster (we deleted it, trying to solve another issue).Steps to reproduce the issue:
I think:
Describe the results you received:
Nodes are unable to fetch new machineconfig. Each action taken results in the same state mentioned above.
Describe the results you expected:
Somehow a way to "remove" the faulty machineconfig (
rendered-master-b9b68ece3045f2312d3d8c77bd520822
) from the cache or wherever it is stored (etcd)?Additional information you deem important (e.g. issue happens only occasionally):
Yesterday we encountered the same issue (again, by trying to write a file to a readonly filesystem). We were able to recover by fixing the machineconfig and deleting the old (faulty) one.
Output of
oc adm release info --commits | grep machine-config-operator
:Additional environment details (platform, options, etc.):
The text was updated successfully, but these errors were encountered: