Skip to content

Commit

Permalink
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
enhancement: add ignition dual spec 2/3 support
Browse files Browse the repository at this point in the history
Add an enhancement proposal for supporting both ignition spec 2
and 3 for OS provisioning/updating.

Signed-off-by: Yu Qi Zhang <[email protected]>
yuqi-zhang committed Nov 19, 2019
1 parent a16654e commit 3d6193d
Showing 1 changed file with 377 additions and 0 deletions.
377 changes: 377 additions & 0 deletions enhancements/ignition-spec-dual-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,377 @@
---
title: Ignition Spec 2/3 dual support
authors:
- "@yuqi-zhang"
reviewers:
- "@ashcrow"
- "@cgwalters"
- "@crawford"
- "@LorbusChris"
- "@miabbott"
- "@mrguitar"
- "@runcom"
- "@vrutkovs"
approvers:
creation-date: 2019-11-04
last-updated: 2019-11-19
status: **provisional**|implementable|implemented|deferred|rejected|withdrawn|replaced
see-also: https://github.com/openshift/enhancements/pull/78
replaces:
superseded-by:
---

# Ignition Spec 2/3 dual support


## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift/docs]


## Summary

This enhancement proposal aims to add dual ignition specification version 2/3
(ignition version 0/2) support to Openshift 4.x, which currently only support
ignition version 0 spec 2 for OS provisioning and machine config updates. We
aim to introduce a non-breaking method to switch all new and existing clusters
to ignition spec version 3 at some version of the cluster, which will be
performed by the Machine-Config-Operator (Henceforth MCO). The Openshift
installer and underlying control plane/bootstrap configuration, as well as RHEL
CoreOS (Henceforth RHCOS) package version will also be updated.

This overall migration will take part as a two-phase process:
Phase 1/OCP 4.4:
- MCD gains the ability to process both v2 and v3 configs
- MCS gains the ability to translate from v3 configs to v2 configs "on the fly"
- MCC attempts to update all v2 configs to v3, leaving those which cannot be converted
- Installer and other components generate v3 configs

Phase 2/OCP 4.5:
- MCO enforces that all configs are v3 before allowing the CVO to start the update
- RHCOS bootimages switches to only accept v3 configs


## Motivation

Ignition released v2.0.0 (stable) on Jun 3rd, 2019, which has an updated
specification format (ignition spec version 3, henceforth “Spec 3”). This
change includes some important reworks for RHCOS provisioning, most
importantly the ability to specify and mount other filesystems and fixing
issues where ignition v0 spec was not declarative. Particularly, this is
required to support having /var on a separate partition or disk, an important
requirement for security/compliance purposes in OCP. The existing version on
RHCOS systems (ignition version v0.33) carries a spec version (spec version 2,
henceforth “Spec 2”) that is not compatible with Spec 3. Thus we would like to
update the ignition version on RHCOS/Installer/MCO to make use of the changes.

This proposal will also allow closer alignment with OKD, as OKD will be based
on Fedora CoreOS (Henceforth FCoS), which is already on, and only supports,
ignition spec 3. We want to do this in a way that can minimize deltas between
OKD and OCP.


### Goals

#### Phase 1:
- [ ] A config translator is created to translate from ignition spec 2 to spec 3
- [ ] A config translator is created to translate from ignition spec 3 to spec 2
- [ ] The MCO is updated to support both ignition spec 2 and 3, with:
- [ ] a translator in the Machine Config Controller that can convert spec 2 cluster configuration to spec 3
- [ ] a detector in the Machine Config Server that can serve the correct spec version based on bootimage version
- [ ] The Openshift installer is updated to generate ignition spec 3 configs
- [ ] Successfully create new Spec 3 only IPI/UPI clusters
- [ ] Create a migration method for non-compatible (non-translatable configs) Spec 2 clusters to Spec 3
- [ ] An alerting mechanism is put in place for outdated and incompatible/non-translatable configs
- [ ] Docs are updated to reflect the new config version

#### Phase 2:
- [ ] RHCOS bootimage is updated to accept ignition spec 3 configs
- [ ] Successfully upgrade existing 4.x clusters to Spec 3 clusters


### Non-Goals

- Support FCoS/OKD directly through this change
- API support for MCO, namely switching to RawExt formatted machineconfig
objects instead of explicitly referencing ignition, is not considered as
part of this proposal


## Proposal

This change is multi-component:


### Vendoring Changes

The MCO and Installer must change to go modules (currently dep) for vendoring
as ignition v2 requires using go modules. To support both spec 2 and spec 3,
both ignition versions must be vendored in for typing.


### Spec Translator

To handle both spec versions, as we need the ability to upgrade existing
clusters, we will create a translator between spec 2 and spec 3. This ensures
that a cluster only has one “desiredConfig” which will be translated to spec 3
when the MCO with dual support detects that the existing configuration of a
machine is on spec 2 (will happen only once for all existing and new nodes,
when the MCO with dual support is first deployed onto the cluster). This will
only be required as part of the MCO.

Note that there exists three types of spec 2 configs:
- Those that are directly translatable to spec 3. This is the case for all existing IPI configs.
- Those that we’re not 100% sure we can directly translate, but we can infer what the user is doing and do a translation
- Those that we’re 100% sure we CANNOT translate directly, and requires user input for us to correctly translate

During phase 1 we should attempt to translate on a best-effort basis. If the
cluster can be directly translated to spec 3, we will do so and use the now
v3 spec version config in the MCD. If not, we will support both, and alert
the user that there are untranslatable configs.

During phase two, we should fail updates unless the cluster is fully on spec
3 config. This effectively means that UPI clusters are at risk when upgrading
to an MCO with dual support. Mitigation methods are discussed below.

For backwards support (spec 3 to spec 2) the configs are 100% translatable.


### RHCOS

The RHCOS bootimage needs to be updated to ignition package v2.0+ . Required
dependencies are discussed below.


### Installer

Phase 1:

The installer needs to be updated to generate spec 3 configs for master and
worker nodes. These are, for now, immediately translated down by the MCS
when served to spec 2 bootimages. If the user has otherwise defined custom
machinesets with spec 3 images, those are served spec 3 configs. This means
that the installer will have to vendor both ignition v0 and v2 until phase 2.

Phase 2:

The bootstrap ignition configs are also updated to serve spec 3, and RHCOS
images pinned by the installer will be updated to ones with ignition v2 (spec3).
All spec 2 references are stripped from the installer.


### MCO

The MCO and its subcomponents are the most affected by this change. The
aforementioned spec translator will be housed in the MCO. This means that the
MCO would need to simultaneously vendor both ignition v0 and v2, and translate
existing between the spec versions as needed.

Phase 1:

The MCD has the capability to understand both spec 2 and spec 3 configs,
and lay down files as needed.

The MCC has the capability to translate spec 2 to spec 3 configs. If the
translation is completely successful, the MCD will be instructed to use
the spec 3 configs. Otherwise the MCD will continue using the existing
spec 2 configs, and apply new spec 2/spec 3 configs. If at any point the
MCC notices it can fully translate to spec 3, that will be the version we
use.

The MCS will host both spec 3 and spec 2 configs, with a functionality to
translate spec 3 down to spec 2 configs, and spec 2 up to spec 3 configs.
The MCS will first check which ignition spec version a new node supports,
before serving a config.

Phase 2:

The MCO will flat out reject spec 2 configs, and refuse to upgrade clusters
that have spec 2 bits.

The MCO will also throw an alert upon:
- An attempted to update to the new ignition spec cannot be performed due to untranslatable configs, and the user must manually update/remove the untranslatable config before the update can continue
- The cluster sees a spec 2 machineconfig applied to it has transitioned to phase 2 (pure spec 3), and rejects that machineconfig.

The MCO should also add ability to reconcile broken spec 2 -> spec 3 updates,
after manual intervention from the user.


### User Stories

** As the admin of an existing 4.x cluster on spec 2, I’d like to upgrade to the newest version and use ignition spec 3 **

Acceptance criteria:
- The update completes without user intervention, if all machine configs existing on the cluster can be directly translated to spec 3
- The user receives an alert, if the update is unable to complete due to untranslatable configs
- The user is able to recover the cluster and finish the update if the untranslatable configs are manually translated or remove, or roll back to the old version
- The user should have received notification that the update will be changing spec version, as well as received necessary documentation on how to recover a failed update
- CI tests are put in place to make sure the existing versions can be updated to the new payload

** As a user of openshift, I’d like to install from a spec 2 bootimage and immediately update to a spec 3 payload **

Acceptance criteria:
- Essentially the same as story 1

** As a user of Openshift, I’d like to install a fresh ignition spec 3 cluster **

Acceptance criteria:
- The workflow remains the same for an IPI cluster
- The workflow remains the same for a UPI cluster, minus custom specification changes
- The user should have good documentation, based on version, of how to set up user defined configs during install time

** As an admin of an existing spec 3 cluster, I’d like to apply a new machineconfig **

Acceptance criteria:
- The machineconfig is applied successfully, if the user has defined a correct spec 3 ignition snippet
- The user is properly alerted if they attempt to apply a spec 2 config, and the machineconfig fails to apply
- The user is given necessary docs to remove the undesired spec 2 config and to translate it to a spec 3 config

** As an admin of an existing spec 3 cluster, I’d like to autoscale a new node **

Acceptance criteria:
- The MCS can correct detect which ignition version to serve the bootimage
- The bootimage boots correctly and pivots to align correctly to the rest of the cluster version


### Risks and Mitigations

** Failing to update a cluster **

The IPI configuration is fully translatable. UPI as well as user provided
configuration as day 2 operations are not workflows we can guarantee. For
some users they will simply fail to update the cluster to a new version. To
mitigate, we must allow the user to recover and/or reconcile, or at the bare
minimum have comprehensive documentation on what to do in this situation

** Failing to apply a spec 2 machineconfig that worked prior to the final update **

Users will likely be unhappy that there is such a large breaking change. In
other similar cases, e.g. for auto-deployed metal clusters, the served ignition
configs must all be updated after a certain point of the bootimage to be able
to bring up new machines. To mitigate we should communicate this change well in
advance, and provide methods to translate ignition configs. Failed
installation/alerting systems must clearly communicate the source of error in
this case, as well as how to mitigate.


## Design Details

The implemented changes for the various components can be separate, with the
caveat that ignition spec 3 support for MCO must happen first (so that other
component changes can be tested in cluster). The MCO changes can be standalone,
as they serve to bring a spec 2 cluster to spec 3, or work as in on a spec 3
cluster.

** RHCOS details: **

RHCOS must change to use ignition v2, which supports spec 3 configs. The actual
switching of the package is very easy on RHCOS. The building of ignition v2,
however, presents two issues:

- The util-linux package is old on RHEL, without support of “lsblk -o PTUUID” which ignition uses. This will have to be reworked in RHCOS, or the package must be bumped and rebuilt for rhel 8.1 or workaround as in https://github.com/coreos/ignition-dracut/pull/133
- Ignition-dracut has seen significant deltas between the spec2x and master (spec 3) branches, especially for initramfs network management. There are also minor details such as targets that need to be checked for existing units. There exists a need to merge some spec2x bits back into 3x before RHCOS can move to 3x

This change will be phase 2 only.

** Installer details: **

The installer would only need to support both ignition spec 2 or spec 3, to
have early support for spec 3 in place. Spec 3 will be generated for worker/
master during phase 1, but bootstrap will be on spec 2 until phase 2.

The generated v3 ignition configs are passed to the bootstrap MCS, which
immediately down-translates them to spec 2. Both those will be served in
parallel, and the MCS will curl the node first to detect ignition version
before serving it the corresponding config.

At the time of writing this proposal, there exist FCOS/OKD branches for the
installer that are looking to move to spec 3, and has had success in installing
a cluster. This work can be integrated for OCP as well. The main issue remaining
is due to the necessity of moving to go modules as the vendoring method, there
as are failures in the Azure Terraform provider that seem to be incompatible
with this change.

** MCO details: **

A spec translator will first be implemented in the MCO, with the ability to
detect untranslatable configs. The MCO then should be updated to have support
for both ignition V0S2 and V2S3. Since the MCO is responsible for the current
cluster nodes, it will be the only place at which spec translation is done.
The translation will happen when the version of MCO with dual support and
translator is first deployed; it will detect the existing config being spec 2,
generate a new renderedconfig based on a translator from spec2 to spec 3. If
this translation is successful, it will instruct the MCD that the spec 3 config
is now the complete renderedconfig of the system. Future spec 2 configs applied
to the system will undergo the same translation. After phase 2 happens, spec 2
configs detected will be rejected and an error thrown.

If the translation fails, the MCO will throw an alert to the admin that the
cluster machineconfig will soon be switching to spec 3, and there are existing
configs that are not translatable. If the admin takes no action, eventually the
cluster will failed to upgrade. In phase 2, the admin will see a failed update
with reason as "Unupgradeable".

The spec translator will also translate the existing bootimage configs served
to new nodes joining the cluster. The MCS will check which config version is
currently being served, and will translate spec 2 to spec 3 and vice-versa,
so the MCS with dual support will always be able to serve both. Failed
spec 2 to spec 3 translations will also be handled as above, with warning
that at some point the cluster will refuse to upgrade. Spec 3 bootimages will
also fail to join the cluster.

The MCO should also add functionality to more easily reconcile broken
machineconfigs and ignition specs being served, thus allowing a cluster admin
to correctly recover/abort a failed update to spec 3.


** Other notes: **

Spec 2 -> Spec 3 translation has not been fully implemented anywhere before.
There could be many edge cases we have not yet considered. There are other
potential difficulties such as serving the correct ignition config. See above
section on risks and mitigations.

Starting from some version of Openshift, likely v4.6, we can remove dual
support and be fully ignition spec 3.

Kubernetes 1.16 onwards has support for CRD versioning:
https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definition-versioning/.
If we opt to delay this that is potentially an alternative method of
implementation.


### Test Plan

Extensive testing of all possible paths, especially those outlined in the
user stories, is critical to the success of this major update. The existing
CI infrastructure is a good start for upgrade testing. There should be
additional tests added, especially in the MCO repo, for edge cases as
described in the user stories, to ensure we never break this behaviour.
Many existing tests will also have to be updated given the spec change.


### Upgrade / Downgrade Strategy

The spec translation will happen as part of an upgrade, when the new MCO
is deployed. See above discussions on alerts during upgrade. For clusters
that are already on spec 3, future upgrades will proceed as usual, much
like what we have in spec 2.


### Graduation Criteria

This is a high risk change. Success of this change will require extensive
testing during upgrades. UPI clusters are especially at risk since there
are potentially situations we cannot reconcile with spec translations.
Some of the exact details need further fleshing out during implementation,
and potentially will be not feasible. Existing user workflow will be
disrupted, so communication of these changes will also be very important.


## Infrastructure Needed [optional]

None extra

0 comments on commit 3d6193d

Please sign in to comment.