Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
blocked-edges: Block edges for storage.conf machine-config bug
The machine-config operator had a bug where MachineConfig entries lead the machine-config daemon (MCD) to lay down a storage.conf that exactly matched the content installed by the containers-common RPM. On update, the RHCOS machine pivots to a new OSTree image (defined in the machine-os-content image referenced from the release image). Seeing storage.conf content that matched the old OSTree image, libostree replaced storage.conf with the version defined in the new OSTree image [1]. Then, when the MCD comes back up post-pivot, it sees the divergent storage.conf content and freaks out with logs like [2]: E1210 16:15:51.105286 11181 daemon.go:1350] content mismatch for file /etc/containers/storage.conf: and the machine-config operator goes Degraded=True with RequiredPoolsFailed "nodes are reporting degraded status on sync" [3]. The narrow machine-config fix was to annotate storage.conf that it writes, libostree doesn't touch the files on pivot [4]. This addresses the storage.conf case, but leaves the MCD vulnerable to other instances of "MCD writes exactly the OSTree contents to $FILE and expects it to remain untouched during an OSTree pivot that bumps the file". I'm not aware of a generic fix at the moment, although [5] might be related. You can guard a cluster against the narrow bug by setting a MachineConfig [6] or higher level object such as a ContainerRuntimeConfig [7] that will cause the MCD to write a storage.conf that diverges (even just by a comment or whitespace) from the OSTree original. Tracking the narrow fix through the various z streams: The 4.1 machine-config bug was introduced in d2c44d7 [8], which landed before 4.1.0-rc.0: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.0-rc.0 | grep machine-config machine-config-controller https://github.com/openshift/machine-config-operator de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e machine-config-daemon https://github.com/openshift/machine-config-operator de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e machine-config-operator https://github.com/openshift/machine-config-operator de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e machine-config-server https://github.com/openshift/machine-config-operator de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e setup-etcd-environment https://github.com/openshift/machine-config-operator de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e $ git --no-pager log --oneline --first-parent de9998eb37 | grep d2c44d7 d2c44d7c Merge pull request openshift#330 from umohnani8/runtime The 4.1 machine-config fix was [9], landed in 1301934 [10], which is new in 4.1.34: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.34-x86_64 | grep machine-config machine-config-controller https://github.com/openshift/machine-config-operator f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b machine-config-daemon https://github.com/openshift/machine-config-operator f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b machine-config-operator https://github.com/openshift/machine-config-operator f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b machine-config-server https://github.com/openshift/machine-config-operator f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b setup-etcd-environment https://github.com/openshift/machine-config-operator f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.31-x86_64 | grep machine-config machine-config-controller https://github.com/openshift/machine-config-operator b38afe6e5b79a3e11e881429dc4c7c70e8784e84 machine-config-daemon https://github.com/openshift/machine-config-operator b38afe6e5b79a3e11e881429dc4c7c70e8784e84 machine-config-operator https://github.com/openshift/machine-config-operator b38afe6e5b79a3e11e881429dc4c7c70e8784e84 machine-config-server https://github.com/openshift/machine-config-operator b38afe6e5b79a3e11e881429dc4c7c70e8784e84 setup-etcd-environment https://github.com/openshift/machine-config-operator b38afe6e5b79a3e11e881429dc4c7c70e8784e84 $ git --no-pager log --oneline --first-parent -2 f56d736e74a f56d736e (origin/release-4.1) Merge pull request openshift#1147 from openshift-cherrypick-robot/cherry-pick-1114-to-release-4.1 1301934a Merge pull request openshift#1382 from vrutkovs/4.1-containers-conf-generated The 4.2 machine-config fix was [2], landed in bd358bb [11], which is new in 4.2.18: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64 | grep machine-config machine-config-operator https://github.com/openshift/machine-config-operator 31fed93186c9f84708f5cdfd0227ffe4f79b31cd $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64 | grep machine-config machine-config-operator https://github.com/openshift/machine-config-operator 9366460085b2a24d825380759f554769ec5ab4f9 $ git --no-pager log --oneline --first-parent -2 9366460085 93664600 Merge pull request openshift#1362 from rphillips/fixes/1787581_4.2 bd358bb7 Merge pull request openshift#1323 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.2 The 4.3 machine-config fix was [12], landed in 9fd53bd [13], which landed early enough for 4.3.0-rc.0: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64 | grep machine-config machine-config-operator https://github.com/openshift/machine-config-operator 23a6e6fb37e73501bc3216183ef5e6ebb15efc7a $ git --no-pager log --oneline --first-parent -8 23a6e6fb37 23a6e6fb Merge pull request openshift#1348 from openshift-cherrypick-robot/cherry-pick-1285-to-release-4.3 80c8aed7 Merge pull request openshift#1343 from retroflexer/cherry-pick-backup-restore-kube-static-resources 269990a3 Merge pull request openshift#1344 from openshift-cherrypick-robot/cherry-pick-1296-to-release-4.3 fd3ca395 Merge pull request openshift#1338 from runcom/fix-go-mod ba304dbb Merge pull request openshift#1333 from openshift-cherrypick-robot/cherry-pick-1278-to-release-4.3 787f3fa9 Merge pull request openshift#1332 from runcom/reserved-cpus-4.3 2b85d6ba Merge pull request openshift#1329 from openshift-cherrypick-robot/cherry-pick-1314-to-release-4.3 9fd53bd5 Merge pull request openshift#1322 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.3 The 4.4 machine-config fix was [3] which has landed before any 4.4 RCs have been cut. Even in 4.4, the generated note was the first content touch to this template: $ git --no-pager log --oneline --follow origin/release-4.4 -- templates/common/_base/files/container-storage.yaml 46c4e27a (origin/pr/1320) templates/container-storage: Add a "this is generated" note 47a6321c templates: Move container-storage.yaml into common/ 74ae3b31 (origin/pr/330) Add ContainerRuntime CRD and Controller (47a6321c was a pure rename). So the MCD has been annotating storage.conf since 4.1.34, 4.2.18, and all 4.3 and later releases. When has the RPM-installed storage.conf changed? Figuring this part out is a bit awkward, because we need to drill down machine-os-content -> RHCOS -> RPM -> file. For example, from 4.2.16 -> 4.2.18 [14]: $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64) | jq -r .config.config.Labels.version 42.81.20200114.0 $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64) | jq -r .config.config.Labels.version 42.81.20200203.1 $ ./differ.py --first-endpoint art --first-version 42.81.20200114.0 --second-endpoint art --second-version 42.81.20200203.1 | jq -r '.diff | keys | sort[]' cri-o ignition libarchive machine-config-daemon openshift-clients openshift-hyperkube sqlite-libs storage.conf is managed by the containers-common RPM, so no change from 4.2.16 to 4.2.18, and that update will safely pull in the fixed MCD without a surprising pivot change. Here are our changes to the RPM across the various z streams: $ for OCP in 4.1.1 4.1.23 4.1.24 4.1.31-x86_64 4.1.34-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.1/${RHCOS}/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done 410.8.20190606.0 0.1.32 4.1.1 410.8.20191030.0 0.1.32 4.1.23 410.81.20191112.2 0.1.37 4.1.24 410.81.20200114.0 0.1.37 4.1.31-x86_64 410.81.20200204.1 0.1.40 4.1.34-x86_64 $ for OCP in 4.2.0-rc.0 4.2.2 4.2.4 4.2.16-x86_64 4.2.18-x86_64 4.2.19-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.2/${RHCOS}/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done 42.80.20190930.1 0.1.32 4.2.0-rc.0 42.80.20191022.0 0.1.32 4.2.2 42.81.20191107.0 0.1.37 4.2.4 42.81.20200114.0 0.1.37 4.2.16-x86_64 42.81.20200203.1 0.1.37 4.2.18-x86_64 42.81.20200210.0 0.1.40 4.2.19-x86_64 $ for OCP in 4.3.0-rc.0-x86_64 4.3.3-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/${RHCOS}/x86_64/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done 43.81.202001072253.0 0.1.40 4.3.0-rc.0-x86_64 43.81.202002170853.0 0.1.40 4.3.3-x86_64 Fetching a source RPM for containers-common, e.g. from [15,16] shows the source packages coming from skopeo. Checking [17]: $ git --no-pager log --follow --oneline --stat=200 -M50% -- vendor/github.com/containers/storage/storage.conf afaa9e7f Bump github.com/containers/storage from 1.15.1 to 1.15.2 vendor/github.com/containers/storage/storage.conf | 3 --- 1 file changed, 3 deletions(-) 39ff039b Image encryption/decryption support in skopeo vendor/github.com/containers/storage/storage.conf | 44 +++++++++++++++++++++++++------------------- 1 file changed, 25 insertions(+), 19 deletions(-) 05ae513b Bump github.com/containers/buildah from 1.8.4 to 1.11.4 vendor/github.com/containers/storage/storage.conf | 7 ------- 1 file changed, 7 deletions(-) 700b3102 update github.com/containers/{image,storage} vendor/github.com/containers/storage/storage.conf | 8 ++++++++ 1 file changed, 8 insertions(+) 033b2902 migrate to go modules vendor/github.com/containers/storage/storage.conf | 130 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 130 insertions(+) $ git --no-pager log --follow --oneline --stat=200 -M50% 033b2902^ -- contrib/storage.conf fe259105 add storage.conf and manpage in contrib/ contrib/storage.conf | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) $ for HASH in fe259105 033b2902 700b3102 05ae513b 39ff039b afaa9e7f; do git describe --contains "${HASH}"; done v0.1.29~3^2 v0.1.38~14^2~2 v0.1.39~1 v0.1.41~25^2 v0.1.41~21^2 v0.1.41~12^2 So changes may have been made in 0.1.29 (when the file landed for the first time, likely from wherever we store post-Git patches), and were likely made in 0.1.38, 0.1.39, and 0.1.41. Comparing with our machine-os-content, that means vulnerable transitions are: * 4.1.* -> 4.1.34, since 4.1.31 -> 4.1.34 takes containers-common from 0.1.37 to 0.1.40, picking up the v0.1.38~14^2~2 and v0.1.39~1 bumps. There may be no safe way to get to 4.1.34. * 4.1.* -> 4.2... FIXME * 4.2.16 and earler -> 4.2.19, since 4.2.18 -> 4.2.19 takes containers-common from 0.1.37 to 0.1.40, picking up the v0.1.38~14^2~2 and v0.1.39~1 bumps. 4.2.16 and earlier -> 4.2.18 is fine, because there were no RPM-induced storage.conf bumps. 4.2.18 -> 4.2.* is fine, because 4.2.18 has the patched machine-config source. * 4.2.16 and earlier -> 4.3, since 4.2.18 -> 4.3 takes containers-common from 0.1.37 to 0.1.40, picking up the v0.1.38~14^2~2 and v0.1.39~1 bumps. 4.2.18 -> 4.3 is fine, because 4.2.18 has the patched machine-config source. * 4.3 -> 4.3 are fine, since they all have the patched machine-config source. So ideally this pull would block edges from 4.2.16 and earlier into 4.3. But because blocked-edges requires explicit to, I've just added the 4.3.0 blocker (other 4.3.z releases either already blocked 4.2.* or only give 4.2.18+ as update sources). I've also dropped 4.2.16 from the *-4.3 channels with a comment about this bug. There shouldn't be much pushback on pulling the edge, because users can still move from 4.2 to 4.3 via 4.2.19 -> 4.3.2. Also simplify the wording on the GCP bug 1793635, which remains unfixed. [1]: openshift/machine-config-operator#1320 (comment) [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1782152#c5 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1781708#c0 [4]: https://github.com/openshift/machine-config-operator/pull/1320/files [5]: openshift/machine-config-operator#1190 [6]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/MachineConfiguration.md [7]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/ContainerRuntimeConfigDesign.md [8]: openshift/machine-config-operator#330 (comment) [9]: https://bugzilla.redhat.com/show_bug.cgi?id=1782153 [10]: openshift/machine-config-operator#1382 (comment) [11]: openshift/machine-config-operator#1323 (comment) [12]: https://bugzilla.redhat.com/show_bug.cgi?id=1782149 [13]: openshift/machine-config-operator#1322 (comment) [14]: https://gitlab.cee.redhat.com/coretools/differ Internal link, sorry :/ But you can also browse the history at: https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.2&release=42.81.20200114.0 etc. [15]: https://access.redhat.com/downloads/content/290/ver=4.2/rhel---8/4.2.0/x86_64/packages [16]: https://access.redhat.com/downloads/content/rhel---8/x86_64/8841/containers-common/0.1.32-5.git1715c90.el8/x86_64/fd431d51/package [17]: https://github.com/containers/skopeo/
- Loading branch information