Skip to content

Commit

Permalink
MCO: admin defined reboot policy enhancement
Browse files Browse the repository at this point in the history
  • Loading branch information
yuqi-zhang committed Jan 4, 2024
1 parent e06505a commit 977a2af
Showing 1 changed file with 292 additions and 0 deletions.
292 changes: 292 additions & 0 deletions enhancements/machine-config/admin-defined-reboot-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,292 @@
---
title: admin-defined-reboot-policy
authors:
- "@yuqi-zhang"
reviewers:
- TBD
approvers:
- TBD
api-approvers:
- TBD
creation-date: yyyy-mm-dd
last-updated: yyyy-mm-dd
tracking-link:
- TBD
see-also:
- https://issues.redhat.com/browse/OCPSTRAT-380
- https://issues.redhat.com/browse/MCO-507
replaces:
superseded-by:
---

# Admin Defined Reboot Policy

## Summary

This enhancement outlines an API/mechanism for users to define what actions to take upon a MachineConfigOperator driven change (e.g. a file change via a MachineConfig object). By default, all changes to MachineConfig fields require a drain and reboot. The user can use this new API, a RebootPolicy, to specify which MachineConfig changes to not disrupt their workloads.

## Motivation

Historically, workload disruption (drains and reboots) can be very costly for OCP users, especially in bare-metal environments where the stopping and restarting of a node can take up to an hour. For small incremental changes to the host OS, many of them can take effect with a service reload or no action at all, and thus the ability to perform a non-disruptive update can be very beneficial to many users without any downside.

The MCO since OCP 4.7 has added the ability to perform disruptionless updates on select changes, see: [MCO doc on disruptionless updates](https://github.com/openshift/machine-config-operator/blob/master/docs/MachineConfigDaemon.md#rebootless-updates). This list however is by no means exhaustive, and it is not feasible for the MCO to hard-code all use cases for all users into the MCO. In addition, different users may have different expectations on what will cause a reboot or not. Thus there is a need for a user-defined, MCO operated list of "reboot policies" on what to do after an OS update has been applied.

### User Stories

* As an Openshift cluster administrator, I have an environment with thousands of ICSP/IDMS mirror entries. These entries will be added and deleted constantly on a weekly basis. I understand the potential dangers of removing mirrors (can cause an image to no longer be accessible if no other mirror hosts it), but I accept that risk and would like my changes to apply without disrupting my workloads and only by reloading the crio service.

* As an application developer, I have certificates for my applications I need on-disk, which can be rotated safely and hot-loaded into the applications without any additional action. I do not wish to disrupt other applications when these update via the MCO.

### Goals

* Create a new RebootPolicy CRD to allow users to define how OS updates are actioned on after application
* Have users be able to define no-action and service reloads to specific MachineConfig changes
* Have users be able to easily see existing cluster non-disruptive update cases

### Non-Goals

* Have RebootPolicy apply to non-MCO driven changes (e.g. SRIOV can still reboot nodes)
* Remove existing non-disruptive update paths (the user will be able to override cluster defaults)
* Design for image-based updates (live apply and bootc, will be considered in the future)
* Have the MCO validate whether a change can be successfully applied with the given RebootPolicy (i.e. it is up to the responsibility of the user to ensure the correctness of their defined actions)

## Proposal

Create a new rebootpolicy.machineconfiguration.openshift.io CRD, allowing specify user-provided rebootpolicies, and view current version cluster defaults, as follows:

```console
apiVersion: machineconfiguration.openshift.io/v1
kind: RebootPolicy
metadata:
name: cluster
spec:
userPolicies:
- type: file
value: /etc/my-conf
Actions:
- type: drain
- type: reload
target: myservice
- type: reload
target: myotherservice
- type: kernelArgument
value: myarg
actions:
- type: none
- type: service
value: myservice
actions:
- type: restart
target: myservice
clusterDefaultPolicies:
- type: file
value: /etc/mco/internal-registry-pull-secret.json
actions:
- type: none
- type: file
value: /var/lib/kubelet/config.json
actions:
- type: none
- type: file
value: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
actions:
- type: reload
target: crio.service
- type: file
value: /etc/containers/policy.json
actions:
- type: reload
target: crio.service
- type: file
value: /etc/containers/registries.conf
actions:
- type: special
```

Which, when applied, will define actions for all future changes to these MachineConfig fields based on the diff of the update (rendered MachineConfigs).

For this initial implementation we will support MachineConfig changes to:
- Files
- Units
- kernelArguments
- sshKeys
And actions for:
- None
- Draining a node
- Reloading a service
- Daemon-reload and restarting of a service
- Reboot (default)
- Special (see workflow description)

In the future, we can extend this in two ways:
Supporting more changes/actions (e.g. user defined script to run after a certain change)
Supporting diffing between ostree images (for image based updates) and bootc managed systems (for bifrost)


### Workflow Description

An admin of an OCP cluster can at any time create/update/delete RebootPolicy objects to reconfigure how they wish updates to be applied. There isn’t any immediate effect so long as the RebootPolicy CR itself is a valid object.

The MachineConfigController’s drain controller will process any update to the object and validate the RebootPolicy for duplication, action validity (but not correctness), etc. If the object is valid, the drain controller will update the object’s status and observedGeneration to indicate success, and if not, update those and degrade the controller, example:

```
status:
conditions:
- lastTransitionTime: "xxx"
Message: “”
status: "False"
type: RebootPolicyValidationFailing
- lastTransitionTime: "xxx"
Message: “Validation of the rebootpolicy has succeed.”
status: "True"
type: RebootPolicyValidated
observedGeneration: x
```

The MachineConfigDaemon will process and apply the current RebootPolicy if it’s valid, indicated by generation == observedGeneration (MCC has processed the latest spec) and RebootPolicyValidated == true/RebootPolicyValidationFailing == false. If not, the daemon will also degrade (and queue a re-sync) waiting for the controller to validate the object.

For now, we only support one cluster-wide rebootpolicy, and as such any rebootpolicy not named “cluster” will be ignored and the MCO will report errors via degrading the operator.

When the user first enables RebootPolicy via featuregate (or at some version, by default), the “cluster” RebootPolicy object will be generated with the existing MachineConfigDaemon defaults today, so the user can also view the default policies that came with the cluster. The user can update these fields if they wish to use a different workflow via userPolicies, but are not allowed to modify clusterDefaultPolicies, which is reserved for MCO to update these as we wish in the future. The fields that exist today are as highlighted above.

Note that /etc/containers/registries.conf has a “Special” keyword. This is due to the extra processing the MCO does today. No user defined policy may use the “Special” keyword.
User-provided changes will overwrite the cluster defaults

The special MCO cert paths ("/etc/kubernetes/kubelet-ca.crt", "/etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem") are not listed in the clusterDefaultPolicies as they are no longer MachineConfig changes

In terms of the policy itself, in case a change has multiple matching policies (e.g. two files changed, both with a different policy), all policies will be applied in order (e.g. reload multiple services). If any change in the list does not match any defined policy, the MCO will default to rebooting and skip over all the policies.

The MachineConfigNodes CR will track the status of the upgrade, and will also be updated to indicate what the RebootPolicy used for the update was.

RebootPolicies are only processed in-cluster and cannot be used to skip the bootstrap pivot during cluster installation.

#### Variation and form factor considerations [optional]

This will work by default on standalone OCP and any other OCP variants that run the MCO in its entirety. (e.g. works for RHEL)

HyperShift deployments do not run the MCO pods directly and will not have RebootPolicy objects.

### API Extensions

- Adds the RebootPolicy CRD (only under MCO namespace and only affects MCO)

### Implementation Details/Notes/Constraints [optional]

This aims more to be a user-facing API to essentially map MachineConfig fields to what action should be taken on them, which aligns with existing MCO functionality to perform a hard-coded list of these policies. However, as we evolve the MCO to fit in with an image based OS update vision, we would need this to work with:
- Existing MachineConfig based entries
- User-provided base-image upgrade cases (live-apply)
- A more unified update interface with bootc

Which we can definitely translate under-the-hood to adapt to the system, but file/unit specific entries like this may not work as effectively when it comes to image-to-image updates, especially at the daemon level.

As there is a need for this in 4.16, we will proceed with the format that works best with the existing MCO architecture, while leaving it open to change for the future.

#### Hypershift [optional]

As noted above, Hypershift cannot directly tap into this. In a bit more detail, Hypershift has 2 modes of operation for nodepools:
Replace upgrades - since all machines are deleted and recreated in this model, there is no method of live-application today
Inplace upgrades - these actually use a special form of the daemon, and we can leverage that to update and perform necessary RebootPolicies. However we probably do not want hosted cluster admins to have direct control over this, much like MachineConfigs, so if we do want to add this we will need a new API in Hypershift to handle this. As such, we are not considering this at this time

### Risks and Mitigations

As noted in the non-goal sections, any RebootPolicy is not verified in the MCO, minus formatting. This means that if you accidentally define a change that needed a reboot to take effect, the cluster would “finish the update” without being able to use the new configuration.

The user thereby needs to understand that no additional verification is taking place, and that they accept any responsibility for RebootPolicies causing configuration not to take effect.

Given the popularity of using pause on MachineConfigPools, it is possible some users will exploit this to skip upgrade reboots. That will likely not break the cluster but will put the configuration in an awkward spot, as we can no longer guarantee that the configuration a node claims to be on is actually applied.

The MCO rebootless apply and reboot apply actually does have 1 major difference still: the application of proper selinux labels. The MCO doesn’t currently have any ability to properly label files it writes, and some users have worked around this by relabelling via a service, which would trigger during the early boot process. With this, we will break that workflow unless the MCO explicitly has the ability to handle selinux labels.

### Drawbacks

There shouldn’t be many drawbacks (this is a pretty lightweight sub-feature that we already in-part ship today) other than what was outlined in Risks and Mitigations.

## Design Details

### Open Questions [optional]

The biggest question is the format and constraints we would like for the UX to be. Conceivably, we have the options to:

- Have the user specify a global policy
- Allow users to wildcard paths
- Allow users to define policies per pool instead of globally
- Provide users more flexibility in defining the actions (e.g. provide a shell script to run to change selinux context)
- Provide users more granularity on changes within a file (e.g. adding vs deleting image mirrors within registries.conf)

This current outline is a bit more restrictive in UX but hopefully a bit more clearly defined.

The question of image based updates is also important, but discussed in the above Implementation Details section.

Selinux labels are discussed in risks and mitigations.

How we would like to handle certificat

### Test Plan

The MCO e2e tests will cover functionality for checking if the policy has taken effect.

### Graduation Criteria

We will add this behind a featuregate during the 4.16 timeframe. This at most only needs a tech preview phase, and then can be GA’ed since the feature is relatively isolated. If needed, we can likely ship as a feature directly in 4.16 and only featuregate during development.

#### Removing a deprecated feature



### Upgrade / Downgrade Strategy



### Version Skew Strategy



### Operational Aspects of API Extensions


#### Failure Modes



#### Support Procedures


## Implementation History


## Alternatives

#### Policy as a subset of MachineConfiguration object

This is essentially the same mechanism as outlined above, except instead of a new RebootPolicy object, we instead build it into a subset of MachineConfiguration object's spec and status.

Benefits:
1. does not require a new API type

Downsides:
1. may want to split into separate objects in the future (e.g. apply to different pools or nodes). MachineConfiguration is designed to be a singular "cluster" object for knobs and dials. Harder to extend
1. can grow to quite a bit list, and inflate the spec and status of the object

#### Per MachineConfig object annotation

Instead of having a new API object, we allow users to add annotations to existing MachineConfig objects in the cluster. When those object changes, the corresponding action is applied.

Benefits:
1. does not require any API changes
1. integrates better with how bootc envisions non-reboot updates

Downsides:
1. this does not fit with existing MCO architecture in two ways:
* most templated machineconfigs (that ship with the cluster) ships as one giant 00-master/00-worker config, which won't have the same level of nuance
* the daemons (performing the actual update) does the action calculation, and does it based on rendered config diffs instead of individual MachineConfigs
These two issues can be solved in the future but will not allow much flexibility with use cases today
1. users can specify the same MachineConfig fields in multiple MachineConfigs. The merge logic will get complex with multiple fields defined in the same file.

#### Manual control

We could ship a general flag for 4.16 that allows users to take over for the MCO and manually perform any necessary actions on the nodes directly, and implement a better method of this after image-based MCO is implemented. This might not be a very safe design, though.

## Infrastructure Needed [optional]



0 comments on commit 977a2af

Please sign in to comment.