Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO-507: admin defined node disruption policy enhancement #1525

Merged

Conversation

yuqi-zhang
Copy link
Contributor

@yuqi-zhang yuqi-zhang commented Dec 6, 2023

Enhancement for users to specify how MachineConfig updates should be acted upon.

See: https://issues.redhat.com/browse/OCPSTRAT-380

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 6, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 6, 2023

@yuqi-zhang: This pull request references MCO-507 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.15." or "openshift-4.15.", but it targets "openshift-4.16" instead.

In response to this:

Draft enhancement for users to specify how MachineConfig updates should be acted upon.

See: https://issues.redhat.com/browse/OCPSTRAT-380

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 6, 2023
Copy link
Contributor

openshift-ci bot commented Dec 6, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Comment on lines 130 to 146
The MachineConfigController’s drain controller will process any update to the object and validate the RebootPolicy for duplication, action validity (but not correctness), etc. If the object is valid, the drain controller will update the object’s status and observedGeneration to indicate success, and if not, update those and degrade the controller, example:

```
status:
conditions:
- lastTransitionTime: "xxx"
Message: “”
status: "False"
type: RebootPolicyValidationFailing
- lastTransitionTime: "xxx"
Message: “Validation of the rebootpolicy has succeed.”
status: "True"
type: RebootPolicyValidated
observedGeneration: x
```

The MachineConfigDaemon will process and apply the current RebootPolicy if it’s valid, indicated by generation == observedGeneration (MCC has processed the latest spec) and RebootPolicyValidated == true/RebootPolicyValidationFailing == false. If not, the daemon will also degrade (and queue a re-sync) waiting for the controller to validate the object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of degrading the operator via conditions as this follows our existing pattern, but just to be super clear, are we talking about degrading the operator via new condition types on an object(say the RebootPolicy in this case), or a new condition type on the MCC itself? I don't think we do degrades via the MCC directly like that, but please correct me if I'm wrong!

You also mention degrading the daemon too, which makes sense. I assume both the controller and the daemon degrades will be simultaneous(barring sync delays), or would there be cases where each could happen independent of the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this original model, I was thinking that the daemon would only error if it had an ongoing upgrade while the RebootPolicy was invalid. If it's not being currently used, the daemon wouldn't double report the error. In the case where both is erroring, it would be like:

  1. controller is failing on validation of RebootPolicy
  2. daemon is failing on waiting for validation of RebootPolicy

Hopefully that's not too confusing. As for the controller error, I was thinking of just erroring out the sync loop as a start. But maybe having the operator object itself sync the above RebootPolicyValidationFailing would be better, now I think about it


The special MCO cert paths ("/etc/kubernetes/kubelet-ca.crt", "/etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem") are not listed in the clusterDefaultPolicies as they are no longer MachineConfig changes

In terms of the policy itself, in case a change has multiple matching policies (e.g. two files changed, both with a different policy), all policies will be applied in order (e.g. reload multiple services). If any change in the list does not match any defined policy, the MCO will default to rebooting and skip over all the policies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably out of my wheelhouse but I'm curious - are actions going to calculated and accumulated, and then applied at the end? Or would it be each action to take place in the order it came(regardless of the type of action)? Is there a disadvantage over doing one over the other? I know you mentioned checking for duplicates earlier, but just want to clarify that applies to actions or not

For example, consider these actions from two reboot policies(not applied at the same time)

  • Policy 1
    reload crio
    reload crio
    reboot

  • Policy 2
    reload crio
    reboot

Would policy 1 cause two crio reloads? Or would the actions take from both of them be the same(1 reload crio, 1 reboot)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, I was hoping we can get away with any merge logic and only validate that you don't have 2 entries for the same object (e.g. one file defined twice, once being a no-action and once being a reload, which self-contradicts). Then afterwards we just apply them in order from the daemon without any additional logic.

Is there a disadvantage over doing one over the other?

Other than doing unnecessary actions (reloading the same service multiple times, for example) there shouldn't be any real downsides. I guess there are potentially scenarios where you need to reload the service in a specific order, but I think we can preserve the definition ordering so that shouldn't be a big issue

Comment on lines 58 to 59
apiVersion: machineconfiguration.openshift.io/v1
kind: RebootPolicy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this enhancement doesn't yet mention the bootc configmap approach (we aren't using it yet I know, doesn't really exist)

But anyways in there I was straw-manning that we have an annotation in the configmap which specifies the "update policy" for it (apply live vs reboot) - there's some ugly bits around whether we need "apply live and execute this arbitrary command" or "apply live and reload these systemd units" that need to be defined.

But anyways to me, associating the update policy with the files makes a lot more sense than having another global list.

Now to implement this today we'd need to add a "reboot policy" section into the rendered config. But I think it would be a lot nicer in the future to get rid of the idea of the big "rendered config" and have the MCD fetch individual MachineConfig objects (like how kubelet fetches individual configmaps, there's no "rendered configmap!").

Doing that gets into versioning/race condition questions around "what happens when an input MC changes while we're doing a rollout" that I think are solvable, but do need some thought.

Now one thing @yuqi-zhang brought up when we were chatting about this is the overall question of "how would this work for files in the base image".

My offhand thoughts on that are that we may want to go in a direction of having "static configmaps" that we can inject into a container image in some way, similar to the arguments around containers/bootc#128

(These things connect because configmaps may specify quadlet units)

Then bootc would be the thing that executes on applying these note!

Anyways...I'm not totally sure we really need to immediately handle the case of handling policies for files embedded in the base image.

It feels like we'd get most of the way if we split out the things that need to be updated live outside of the base image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points 👍 , a few thoughts around this:

  1. The global approach in a sense can help us merge the paths of MachineConfig and Configmap defined objects. So if you define a rebootpolicy, it can apply to both update paths, and eventually container image diffs potentially. So there is a singular API and the MCO can handle the specifics when doing the actual content update. I guess the downside would be that if bootc configmaps allow for that we'd have multiple locations where this can be defined, but that can also be an MCO implementation detail and this should give us enough flexibility
  2. The timeline for this is going to be 4.16 which will likely predate any full implementation of bootc live-applies and configmaps going into the MCO, since there is some urgency around this, otherwise I'd love to wait and see what we finalize on in terms of bootc-based updates
  3. The Configmap approach doesn't map as well to MachineConfigs (at least not the annotations) since a lot of MachineConfigs are a giant bundle of different files/units, due to our existing architecture, and not one-file-per-machineconfig
  4. As a bridge gap, I think the idea of extending the MachineConfig API instead of a new API object is an alternative approach we can pursue. This may actually be easier to conceptualise for some users since I think most users think of "changing a machineconfig" as the catalyst to an update, and not necessarily "a new rendered config for the system was created and we're doing a diff update". I did consider this approach and originally rejected it because most of the "no-reboot updates" that are asked about are certificates, or registry configuration etc. which comes all in the cluster-supplied 00-worker and 00-master MCs, which you can't edit. And you definitely should not be providing your own machineconfigs for those files.
  5. So then if we do opt for this instead I guess the user would end up defining a RebootPolicy as a separate MachineConfig still (for example, for image registry mirror changes, they would define a machineconfig that just has a "no-action" field). The downside would be... that it's a bit confusing for users to write a MachineConfig specifically to define the "how does my file reboot", but I don't see a great alternative. The upside is that it can be separated from the configmaps in the future and both would work separately.

I'm still on the fence about this. Do you think having the MCO write RebootPolicies into bootc configmaps instead eventually would work as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider this approach and originally rejected it because most of the "no-reboot updates" that are asked about are certificates, or registry configuration etc. which comes all in the cluster-supplied 00-worker and 00-master MCs, which you can't edit.

But we can easily split out those files into a distinct e.g. 01-worker-configs MC today and aim for stuff that changes in -configs to have distinct "apply" policies.

actions:
- type: restart
target: myservice
clusterDefaultPolicies:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we'd mix "cluster" and "user" in one single CRD. Why not have RebootPolicy have multiple instances (like machineconfig and tons of other things) and just some of them are cluster owned (i.e. changes get overwritten)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to define the initial approach this way for simplicity because it doesn't require us to:

  1. merge multiple objects
  2. validate separately

So in this way we can validate via the object via the singular cluster object status, and the daemon can just react to that, and we can re-evaluate for GA. We have a similar schema for the "node.config.openshift.io" object (only one "cluster" object is allowed")

But it is definitely more flexible and safer to have multiple... we can implement a merging schema. I don't really want us to have to create a "rendered-rebootpolicy" object but I guess we can have the controller validate each one separately, and the daemon to apply them in... alphanumerical order? We may not even need a merging schema and just have the daemon apply them in order, which at worst will be a few duplicated commands.


### Goals

* Create a new RebootPolicy CRD to allow users to define how OS updates are actioned on after application
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think this needs to be a new CRD or could it be part of the MachineConfiguration operator CR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defaulted to a machineconfig-like approach of a new CRD since it's a bit easier to manage, but I think technically speaking there's no reason why it can't be part of MachineConfiguration.

So then our options would be:

  1. have it be part of MachineConfiguration knobs
  2. have it be a new CRD
  3. have it extend MachineConfig objects

I think if 1 is easier we can certainly take that approach, but I wonder if we'd be stuffing too many things into MachineConfiguration. What's the general guideline on what warrants a new CRD vs just a high level knob?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you expect there to be one configuration per cluster for this? If you do, then I think containing it in the MachineConfiguration singleton probably makes sense.
If the config might vary per machinepool and you expect we might need many duplicates, then perhaps we look at something separate.

The guidance I'd give is about the consumer. Group things together that the same consumers would want to look at/configure. I expect this reboot configuration will fit in nicely within either the MachineConfiguration singleton, or within the MachinePool if you wanted the MachinePools to have independent configs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of the use cases we know of so far, there shouldn't be a need to specify this per-machineconfigpool. Doing it as a separate CRD will give us more flexibility (and the CRD name itself is a bit more direct on what this is configuring, vs the overall MachineConfiguration object) but I think perhaps the end goal is to make this work better with images, so in this sense, doing it via the MachineConfiguration object for cluster-wide, individual MachineConfig effects works best as a first implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, we can start with either approach during tech preview, there's no need to promote whatever we do in tech preview to GA. But perhaps we need to explore the pros and cons of each approach first.
The one thing to bear in mind though, is that, once a name has bene used, even in techpreview, it cannot be re-used later with a different shape, so we should probably sync up on this at some point.

I think for now I lean towards seeing what the reboot policy would look like within the MachineConfiguration CR


An admin of an OCP cluster can at any time create/update/delete RebootPolicy objects to reconfigure how they wish updates to be applied. There isn’t any immediate effect so long as the RebootPolicy CR itself is a valid object.

The MachineConfigController’s drain controller will process any update to the object and validate the RebootPolicy for duplication, action validity (but not correctness), etc. If the object is valid, the drain controller will update the object’s status and observedGeneration to indicate success, and if not, update those and degrade the controller, example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be done via a controller rather than at admission time using CEL or a validating webhook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to leave this open for potentially merging multiple rebootpolicies, sort of like MachineConfigs, since I was a bit ambivalent on having just one.

So in that case, users can still provide any configuration of their own, cluster would be immutable and generated by default, and the controller will "merge" them in some fashion.

I suppose if we want to just have one object (or as a field in MachineConfiguration as mentioned above) having a CEL or validating webhook will be better. I assume those can't be used to cross-validate against other objects?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are options for cross validating against other objects, we can explore that if it is a requirement.

In general, having the reboot policy defined in a single place would likely prevent duplication anyway no? For example, we can use map keys to make sure certain keys within the lists have unique entries


The MachineConfigDaemon will process and apply the current RebootPolicy if it’s valid, indicated by generation == observedGeneration (MCC has processed the latest spec) and RebootPolicyValidated == true/RebootPolicyValidationFailing == false. If not, the daemon will also degrade (and queue a re-sync) waiting for the controller to validate the object.

For now, we only support one cluster-wide rebootpolicy, and as such any rebootpolicy not named “cluster” will be ignored and the MCO will report errors via degrading the operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically we can prevent these objects from being created using validating webhooks, or now even a CEL validation. So perhaps restate this to say that CEL validation will prevent creating an object not named cluster


For now, we only support one cluster-wide rebootpolicy, and as such any rebootpolicy not named “cluster” will be ignored and the MCO will report errors via degrading the operator.

When the user first enables RebootPolicy via featuregate (or at some version, by default), the “cluster” RebootPolicy object will be generated with the existing MachineConfigDaemon defaults today, so the user can also view the default policies that came with the cluster. The user can update these fields if they wish to use a different workflow via userPolicies, but are not allowed to modify clusterDefaultPolicies, which is reserved for MCO to update these as we wish in the future. The fields that exist today are as highlighted above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will you stop users modifying these, but allow the controller to update these? Perhaps this should be enforced by the "have a spec, copy to status, observe from status" pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to have the controller validate and overwrite any changes to those fields, much like any other object the MCO manages (for example, if you try to overwrite any of the controller-managed MachineConfig objects, the controller will simply re-apply the old one back. There is a delay so the rendered config doesn't get generated.)

The status will just reflect that the controller has ack'ed whatever is in the spec in this current pattern, so we can save re-listing all the policies in the status. But if the copy-to-status pattern is better, I think we can save a bit of waiting in the daemon (the current approach has the daemon wait for the latest, vs using the last-validated in the status in this new method) so it might be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably go for the copy to status approach.

The idea here is that the user can define what they want, and that is spec. What the user says should never be modified. A controller then validates the changes and copies to status, if the changes from the user are acceptable, rejecting and logging (via a condition maybe) why it's ignoring anything from the user spec, and merging anything that must exist.

Then whatever consumes the data, can always rely on the status fields, because they have already been validated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, as an MVP, I think we can try for MachineConfiguration spec and status.


The user thereby needs to understand that no additional verification is taking place, and that they accept any responsibility for RebootPolicies causing configuration not to take effect.

Given the popularity of using pause on MachineConfigPools, it is possible some users will exploit this to skip upgrade reboots. That will likely not break the cluster but will put the configuration in an awkward spot, as we can no longer guarantee that the configuration a node claims to be on is actually applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean for support cases? It sounds like we need to have an internal list of what can and cannot skip reboots, and force reboots where they are actually required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at least initially, the usage of rebootpolicies means that you the user take responsibility for them, so from a support perspective this shouldn't be any different than say, you modified a file manually via ssh, and broke your cluster as a result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I SSH into a node, it tells me that this has been logged and the node is now out of support. Will you have something similar for reboot policies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I SSH into a node, it tells me that this has been logged and the node is now out of support. Will you have something similar for reboot policies?

So I think we actually removed the ssh warning a few versions ago. The whole ssh saga was a long history of errors and sub-optimal behaviour.

The SSH warning was part of a node annotation (and I guess, a daemon log entry) and didn't make it further than that. We can certainly add a node annotation when a rebootpolicy did take effect. Is that sufficient? We can also try to leverage the new MachineConfigNodes objects to reflect in their status instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to come up with something to define our boundary for support. If you think there are going to be complex scenarios if users use certain features in tandem, then prevent them from doing so, or, flag it in a status so that we can say, oh well you did XYZ which we told you not to do, that's why it broke.

If we don't do this, I expect you'll see a number of bugs come through that are complicated, will require investigation, and then will conclude, oh they did something unsupported


As noted in the non-goal sections, any RebootPolicy is not verified in the MCO, minus formatting. This means that if you accidentally define a change that needed a reboot to take effect, the cluster would “finish the update” without being able to use the new configuration.

The user thereby needs to understand that no additional verification is taking place, and that they accept any responsibility for RebootPolicies causing configuration not to take effect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will you explain this to a user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs, hopefully :)

We can craft an alert or something to indicate when their rebootpolicy took effect upon an update. Without constraining the user to set based on a list of tested "safe" policies (Which goes back to what we already have today, and we don't want to continue extending necessarily) there isn't a way to guarantee (that I know of) user-defined policies to be always "Safe"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing some metric or alert that is easy for support to spot I think would have value

Comment on lines 64 to 71
- type: file
value: /etc/my-conf
Actions:
- type: drain
- type: reload
target: myservice
- type: reload
target: myotherservice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see the Go type for this, with godoc documentation and kubebuilder validations. Currently the API looks almost scary to me, in that I have very little context of what any of the things here mean. I'm concerned about how usable this will be to an average user so documentation on this API will be very important IMO. Perhaps we can discuss in the API extensions section

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks very similar to https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html and would almost certainly face the same pressure to become similarly extensible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that, I think this API is going to need to be thought out very carefully, we can take a look at the extensibility built into playbook API as inspiration, thanks for the link


Which, when applied, will define actions for all future changes to these MachineConfig fields based on the diff of the update (rendered MachineConfigs).

For this initial implementation we will support MachineConfig changes to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A users expected to understand the changes introduced as part of the payload, and then add policies before they upgrade? If so, where will they get this information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Payload updates are not what this is intended for. Basically every payload update we have contains an OS update to RHCOS, which isn't rebootless.

The main use case is for specific machineconfig-defined file changes to skip this process, outside of major version upgrades

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok! I didn't get that when reading the motivation/goals, perhaps an update there might make that more obvious.

Given that, I think this probably alleviates my concern. The user is making changes to lay specific files down or change specific files, so they know what they need to do.

Is there an ordering issue here? Do they need to make sure that the reboot policy change is applied before they apply the MachineConfig change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps an update there might make that more obvious

I'll add a snippet to indicate that this is not for OCP major/minor upgrades, but instead for individual MachineConfig updates performed by the user

Is there an ordering issue here? Do they need to make sure that the reboot policy change is applied before they apply the MachineConfig change?

With the validation step above, I guess the user must ensure that the rebootpolicy they want has properly made it into the status of (either the MachineConfiguration or a new CR) object before starting their MachineConfig update

In practice if you apply them in quick succession, there is a small inbuilt delay into MachineConfig rendering -> pool acquiring -> daemon processing (of ~15 seconds) such that the rebootpolicy should take effect. I don't think we need to have addition inbuilt checks for this

Comment on lines 64 to 65
- type: file
value: /etc/my-conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very very specific. Do we think users want this level of granularity, or do they want classes of changes to ignore reboots for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can certainly support wildcarding paths, but this is mostly inline with the MachineConfig objects we have today.

What would you define as a "class of change" in this situation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this applying to all changes, I think if we are saying this is only applicable to everyday changes and not updates, then I don't think this comment really applies

superseded-by:
---

# Admin Defined Reboot Policy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At an overall thing I think we don't want to hardcode the term "reboot" here because some changes today are still disruptive in terms of draining, but don't actually involve a reboot right? (Hm, the code flow in the MCO today is a bit confusing on this; I'd have expected postConfigChangeActionDrain but it's special cased)

Certainly in bootc I plan at some point to support systemd userspace-only restarts too which will act differently from reboots.

And then this whole enhancement is really about enabling special casing specific changes which are not reboots...

So maybe "Admin defined Host Change Apply Policy" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked "reboots" since that's how most people view it :) I was considering the alternative of "Admin defined disruption policy" instead but disruption has multiple meanings

The title I can certainly change to be more descriptive, the naming of the api object (or other alternatives) is definitely more important

@yuqi-zhang yuqi-zhang force-pushed the admin-defined-reboots branch from 38f527e to 977a2af Compare January 4, 2024 01:38
@yuqi-zhang
Copy link
Contributor Author

yuqi-zhang commented Jan 4, 2024

Added 2 alternatives as discussed above, with some of the benefits and drawbacks to the approaches. Happy to discuss further

@yuqi-zhang yuqi-zhang force-pushed the admin-defined-reboots branch from 977a2af to b3ae862 Compare February 9, 2024 19:03
@yuqi-zhang yuqi-zhang changed the title MCO-507: admin defined reboot policy enhancement MCO-507: admin defined node disruption policy enhancement Feb 9, 2024
@yuqi-zhang yuqi-zhang marked this pull request as ready for review February 9, 2024 19:04
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2024
@openshift-ci openshift-ci bot requested review from eparis and zaneb February 9, 2024 19:05
@yuqi-zhang
Copy link
Contributor Author

Addressed most of the comments and updated the enhancement. Key points:

  1. now is a sub-section of MachineConfiguration object instead of a new CRD
  2. now called Node Disruption Policy instead of Reboot Policy, for clarity (since it encompasses drains, etc.)

@yuqi-zhang
Copy link
Contributor Author

API draft here: openshift/api#1764

@yuqi-zhang yuqi-zhang force-pushed the admin-defined-reboots branch 2 times, most recently from 0516047 to 3ac9a98 Compare February 9, 2024 20:27
@dhellmann
Copy link
Contributor

#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template.


### Goals

* Create a NodeDisruptionPolicyConfig subfield in the MachineConfiguration CRD to allow users to define how OS updates are actioned on after application
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ending of this sentence seems a bit confusing. Perhaps we wanted to write something after application

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited

For this initial implementation we will support MachineConfig changes to:
- Files
- Units
- kernelArguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we envisioning applying kargs without performing reboot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed that from the list in the API and here


### API Extensions

- Adds fields to the MachineConfiguration CRD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linking to API PR would make sense here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


The user thereby needs to understand that no additional verification is taking place, and that they accept any responsibility for NodeDisruptionPolicies. If any are applied to the cluster, a metric should be raised for debuggability.

Given the popularity of using pause on MachineConfigPools, it is possible some users will exploit this to skip upgrade disruption. That will likely not break the cluster but will put the configuration in an awkward spot, as we can no longer guarantee that the configuration a node claims to be on is actually applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused how pausing pool impact here. Are we allowing admin to skip drain/reboot for OS update or something else I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded to remove confusion. I didn't mean to talk about paused pools


- Have the user specify a global policy
- Allow users to wildcard paths
- Allow users to define policies per pool instead of globally
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping policies at cluster level makes sense to me initially since in regular usecase if a change is safe, it would be safe to apply on all nodes. Maybe something we can fine tune afterward based on usecases.

The biggest question is the format and constraints we would like for the UX to be. Conceivably, we have the options to:

- Have the user specify a global policy
- Allow users to wildcard paths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be useful for applying at directory level. Especically when adding drop-in files into drop-in directory.

- Have the user specify a global policy
- Allow users to wildcard paths
- Allow users to define policies per pool instead of globally
- Provide users more flexibility in defining the actions (e.g. provide a shell script to run to change selinux context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User can write a systemd unit to achieve that and that unit itself can be defined in the policy :)

- Allow users to wildcard paths
- Allow users to define policies per pool instead of globally
- Provide users more flexibility in defining the actions (e.g. provide a shell script to run to change selinux context)
- Provide users more granularity on changes within a file (e.g. adding vs deleting image mirrors within registries.conf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

icsp changes are tricky. We may want to handle it as a special case. Something like ClusterDefaultPolicy for this would be set to no drain and reboot for addition. And userPolicy may specify deletion.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me (though I think from prior discussion this is true) that this policy only applies to users modifying MachineConfig objects, and doesn't apply in the general case for node upgrades. It would be good to clarify if users can prevent reboots if say, an upgrade says the kubelet service needs to be updated, and they specify to prevent that reboot or service restart.

Having users have the ability to prevent nodes from restarting for core updates to kubelet and crio could cause support concerns down the road.

Another thing I may have missed, but, will I be able to apply the policy to a subset of machine config pools? Is there any time where I may want to skip a reboot on workers, but I'm happy to do so on masters?

Comment on lines 151 to 153
status:
validation: Failed
Reason: Invalid NodeDisruptionPolicy spec: xxx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably wants to be a condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and added a TODO in the API PR


When the user first enables the new nodeDisruptionPolicySpec via featuregate (or at some version, by default), the "default" MachineConfiguration object will be generated with the existing cluster defaults by the MachineConfigController, so the user can also view the default policies that came with the cluster. The user can update these fields if they wish to use a different workflow via userPolicies, but are not allowed to modify clusterDefaultPolicies, which is reserved for MCO to update these as we wish in the future. The controller will enforce this when generating the status. The fields that exist today are as highlighted above.

Note that /etc/containers/registries.conf has a “Special” keyword. This is due to the extra processing the MCO does today. No user defined policy may use the “Special” keyword.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be spelt out clearly in the API. I would like to understand more about the Special keyword so lets talk about that at some point, how it manifests and the implications of it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "special" keyword is reserved for the MCO-defined actions and we will reject any user policies with this keyword.

I was hoping to use the phrasing in describe to flesh out what this means to the user. There are basically additional logic for registries specifically around mirror-by-digest-only changes that exist in the MCD today. This is just for user awareness and not something they can set themselves.


The special MCO cert paths ("/etc/kubernetes/kubelet-ca.crt", "/etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem") are not listed in the clusterDefaultPolicies as they are no longer MachineConfig changes.

In terms of the policy itself, in case a change has multiple matching policies (e.g. two files changed, both with a different policy), all policies will be applied in order (e.g. reload multiple services). If any change in the list does not match any defined policy, the MCO will default to rebooting and skip over all the policies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the API change to make path the map key will mean this statement is null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


In terms of the policy itself, in case a change has multiple matching policies (e.g. two files changed, both with a different policy), all policies will be applied in order (e.g. reload multiple services). If any change in the list does not match any defined policy, the MCO will default to rebooting and skip over all the policies.

The MachineConfigNodes CR will track the status of the upgrade, and will also be updated to indicate what the NodeDisruptionPolicy used for the update was.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an API extension for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not today, since MCN is also tech preview.

Will make it a goal for GA


### API Extensions

- Adds fields to the MachineConfiguration CRD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link to API PR at a minimum. Providing YAML examples to show usage would also be helpful for other readers. We can also talk about the validations that will apply here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. An example is provided above


The user thereby needs to understand that no additional verification is taking place, and that they accept any responsibility for NodeDisruptionPolicies. If any are applied to the cluster, a metric should be raised for debuggability.

Given the popularity of using pause on MachineConfigPools, it is possible some users will exploit this to skip upgrade disruption. That will likely not break the cluster but will put the configuration in an awkward spot, as we can no longer guarantee that the configuration a node claims to be on is actually applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think the disruption policy would apply to cluster version upgrades?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. Removed

@yuqi-zhang yuqi-zhang force-pushed the admin-defined-reboots branch 3 times, most recently from 2cbafab to 0af2246 Compare February 16, 2024 00:06
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2024
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 22, 2024
@yuqi-zhang yuqi-zhang force-pushed the admin-defined-reboots branch from 0af2246 to 0264993 Compare March 22, 2024 21:37
@yuqi-zhang
Copy link
Contributor Author

/remove-lifecycle rotten

Updated based on updates to the API PR

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 22, 2024
Copy link
Contributor

@djoshy djoshy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me! Just a few minor nits and some questions about the future work. Thanks for working on this, Jerry!

- Files
- Units
- sshKeys
And actions for:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need a newline here for better formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

In the future, we can extend this in two ways:
Supporting more changes/actions (e.g. user defined script to run after a certain change)
Supporting diffing between ostree images (for image based updates) and bootc managed systems (for bifrost)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding directories/wildcarding plans here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


An admin of an OCP cluster can at any time create/update/delete nodeDisruptionPolicy fields in the MachineConfiguration object to reconfigure how they wish updates to be applied. There isn’t any immediate effect so long as the CR itself is valid.

The MachineConfigOperator pod will process any update to the object and validate the NodeDisruptionPolicy for potential issues, although most, if not all are covered by API-level validation. The MCO pod will then merge the user provided contents to the cluster defaults (user contents will always override cluster defaults) and populate the nodeDisruptionPolicyStatus. Success and errors will also be reflected in the status:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this could change in the future, but as it stands we are completely guarded by API validation, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modifed based on our current implementation

#### Hypershift [optional]

As noted above, Hypershift cannot directly tap into this. In a bit more detail, Hypershift has 2 modes of operation for nodepools:
Replace upgrades - since all machines are deleted and recreated in this model, there is no method of live-application today
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: better formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


In terms of the policy itself, in case a change has multiple matching policies (e.g. two files changed, both with a different policy), all policies will be applied in order (e.g. reload multiple services). Any invalid policy, such as invalid actions or duplicate entries, will be rejected at the API level.

The MachineConfigNodes CR will later be enhanced to track the status of the upgrade, and will also be updated to indicate what the NodeDisruptionPolicy used for the update was.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Off the top of my head, I guess we need:

  • NodeDisruptionPolicyApplySuccessful - this will include the generation of the overall object and MCD can track/wait on this before starting to apply the actions in Status
  • NodeDisruptionPolicyExecutionSuccessful - perhaps list the target config and the actions applied
  • NodeDisruptionPolicyExecutionFailure- indicate why failure happened and that the legacy path was taken. Right now I can only think of this being the status not being generated correctly/in time..or in a buggy format that we did not anticipate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can start off with just a success/fail when the node update gets to that stage. But agreed, we can add a lot of detail here for users to see, if they care

@yuqi-zhang yuqi-zhang force-pushed the admin-defined-reboots branch from 0264993 to efff3e8 Compare April 18, 2024 20:52
@djoshy
Copy link
Contributor

djoshy commented Apr 22, 2024

/lgtm
Thanks for working on this with me @yuqi-zhang I believe this is now inline with our current MVP. I think the test can be overridden since we are still in 4.16 development window, but I don't believe I have that power. Would you mind overriding @JoelSpeed , assuming you have no other reservations? (:

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2024
@yuqi-zhang
Copy link
Contributor Author

Thanks David :) Based on the current failure, seeing if we can override based on #1436 (comment)

/override ci/prow/markdownlint

Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

@yuqi-zhang: yuqi-zhang unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight openshift-staff-engineers.

In response to this:

Thanks David :) Based on the current failure, seeing if we can override based on #1436 (comment)

/override ci/prow/markdownlint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented May 2, 2024

@yuqi-zhang: This pull request references MCO-507 which is a valid jira issue.

In response to this:

Enhancement for users to specify how MachineConfig updates should be acted upon.

See: https://issues.redhat.com/browse/OCPSTRAT-380

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sinnykumari
Copy link
Contributor

All existing comments has been addressed and respective API and MCO implementation has been merged.
Let's take any follow-up feedback anyone may have later on in a follow-up PR.
/approve
Overriding test based on #1525 (comment)
/override ci/prow/markdownlint

Copy link
Contributor

openshift-ci bot commented May 6, 2024

@sinnykumari: Overrode contexts on behalf of sinnykumari: ci/prow/markdownlint

In response to this:

All existing comments has been addressed and respective API and MCO implementation has been merged.
Let's take any follow-up feedback anyone may have later on in a follow-up PR.
/approve
Overriding test based on #1525 (comment)
/override ci/prow/markdownlint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

openshift-ci bot commented May 6, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented May 6, 2024

@yuqi-zhang: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit c29c69d into openshift:master May 6, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants