Skip to content

Commit

Permalink
blocked-edges: Block edges for storage.conf machine-config bug
Browse files Browse the repository at this point in the history
The machine-config operator had a bug where MachineConfig entries lead
the machine-config daemon (MCD) to lay down a storage.conf that
exactly matched the content installed by the containers-common RPM.
On update, the RHCOS machine pivots to a new OSTree image (defined in
the machine-os-content image referenced from the release image).
Seeing storage.conf content that matched the old OSTree image,
libostree replaced storage.conf with the version defined in the new
OSTree image [1].  Then, when the MCD comes back up post-pivot, it
sees the divergent storage.conf content and freaks out with logs like
[2]:

  E1210 16:15:51.105286   11181 daemon.go:1350] content mismatch for file /etc/containers/storage.conf:

and the machine-config operator goes Degraded=True with
RequiredPoolsFailed "nodes are reporting degraded status on sync" [3].

The narrow machine-config fix was to annotate storage.conf that it
writes, libostree doesn't touch the files on pivot [4].  This
addresses the storage.conf case, but leaves the MCD vulnerable to
other instances of "MCD writes exactly the OSTree contents to $FILE
and expects it to remain untouched during an OSTree pivot that bumps
the file".  I'm not aware of a generic fix at the moment, although [5]
might be related.  You can guard a cluster against the narrow bug by
setting a MachineConfig [6] or higher level object such as a
ContainerRuntimeConfig [7] that will cause the MCD to write a
storage.conf that diverges (even just by a comment or whitespace) from
the OSTree original.

Tracking the narrow fix through the various z streams:

The 4.1 machine-config bug was introduced in d2c44d7 [8], which landed
before 4.1.0-rc.0:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.0-rc.0 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-server                         https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
  $ git --no-pager log --oneline --first-parent de9998eb37 | grep d2c44d7
  d2c44d7c Merge pull request openshift#330 from umohnani8/runtime

The 4.1 machine-config fix was [9], landed in 1301934 [10], which is
new in 4.1.34:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.34-x86_64 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-server                         https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.31-x86_64 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-server                         https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
  $ git --no-pager log --oneline --first-parent -2 f56d736e74a
  f56d736e (origin/release-4.1) Merge pull request openshift#1147 from openshift-cherrypick-robot/cherry-pick-1114-to-release-4.1
  1301934a Merge pull request openshift#1382 from vrutkovs/4.1-containers-conf-generated

The 4.2 machine-config fix was [2], landed in bd358bb [11], which is new
in 4.2.18:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64 | grep machine-config
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       31fed93186c9f84708f5cdfd0227ffe4f79b31cd
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64 | grep machine-config
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       9366460085b2a24d825380759f554769ec5ab4f9
  $ git --no-pager log --oneline --first-parent -2 9366460085
  93664600 Merge pull request openshift#1362 from rphillips/fixes/1787581_4.2
  bd358bb7 Merge pull request openshift#1323 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.2

The 4.3 machine-config fix was [12], landed in 9fd53bd [13], which
landed early enough for 4.3.0-rc.0:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64 | grep machine-config
  machine-config-operator                       https://github.com/openshift/machine-config-operator                       23a6e6fb37e73501bc3216183ef5e6ebb15efc7a
$ git --no-pager log --oneline --first-parent -8 23a6e6fb37
23a6e6fb Merge pull request openshift#1348 from openshift-cherrypick-robot/cherry-pick-1285-to-release-4.3
80c8aed7 Merge pull request openshift#1343 from retroflexer/cherry-pick-backup-restore-kube-static-resources
269990a3 Merge pull request openshift#1344 from openshift-cherrypick-robot/cherry-pick-1296-to-release-4.3
fd3ca395 Merge pull request openshift#1338 from runcom/fix-go-mod
ba304dbb Merge pull request openshift#1333 from openshift-cherrypick-robot/cherry-pick-1278-to-release-4.3
787f3fa9 Merge pull request openshift#1332 from runcom/reserved-cpus-4.3
2b85d6ba Merge pull request openshift#1329 from openshift-cherrypick-robot/cherry-pick-1314-to-release-4.3
9fd53bd5 Merge pull request openshift#1322 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.3

The 4.4 machine-config fix was [3] which has landed before any 4.4 RCs
have been cut.  Even in 4.4, the generated note was the first content
touch to this template:

  $ git --no-pager log --oneline --follow origin/release-4.4 -- templates/common/_base/files/container-storage.yaml
  46c4e27a (origin/pr/1320) templates/container-storage: Add a "this is generated" note
  47a6321c templates: Move container-storage.yaml into common/
  74ae3b31 (origin/pr/330) Add ContainerRuntime CRD and Controller

(47a6321c was a pure rename).

So the MCD has been annotating storage.conf since 4.1.34, 4.2.18, and
all 4.3 and later releases.  When has the RPM-installed storage.conf
changed?  Figuring this part out is a bit awkward, because we need to
drill down machine-os-content -> RHCOS -> RPM -> file.  For example,
from 4.2.16 -> 4.2.18 [14]:

  $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64) | jq -r .config.config.Labels.version
  42.81.20200114.0
  $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64) | jq -r .config.config.Labels.version
  42.81.20200203.1
  $ ./differ.py --first-endpoint art --first-version 42.81.20200114.0 --second-endpoint art --second-version 42.81.20200203.1 | jq -r '.diff | keys | sort[]'
  cri-o
  ignition
  libarchive
  machine-config-daemon
  openshift-clients
  openshift-hyperkube
  sqlite-libs

storage.conf is managed by the containers-common RPM, so no change
from 4.2.16 to 4.2.18, and that update will safely pull in the fixed
MCD without a surprising pivot change.  Here are our changes to the
RPM across the various z streams:

  $ for OCP in 4.1.1 4.1.23 4.1.24 4.1.31-x86_64 4.1.34-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.1/${RHCOS}/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  410.8.20190606.0 0.1.32 4.1.1
  410.8.20191030.0 0.1.32 4.1.23
  410.81.20191112.2 0.1.37 4.1.24
  410.81.20200114.0 0.1.37 4.1.31-x86_64
  410.81.20200204.1 0.1.40 4.1.34-x86_64
  $ for OCP in 4.2.0-rc.0 4.2.2 4.2.4 4.2.16-x86_64 4.2.18-x86_64 4.2.19-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.2/${RHCOS}/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  42.80.20190930.1 0.1.32 4.2.0-rc.0
  42.80.20191022.0 0.1.32 4.2.2
  42.81.20191107.0 0.1.37 4.2.4
  42.81.20200114.0 0.1.37 4.2.16-x86_64
  42.81.20200203.1 0.1.37 4.2.18-x86_64
  42.81.20200210.0 0.1.40 4.2.19-x86_64
  $ for OCP in 4.3.0-rc.0-x86_64 4.3.3-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/${RHCOS}/x86_64/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  43.81.202001072253.0 0.1.40 4.3.0-rc.0-x86_64
  43.81.202002170853.0 0.1.40 4.3.3-x86_64

Fetching a source RPM for containers-common, e.g. from [15,16] shows
the source packages coming from skopeo.  Checking [17]:

  $ git --no-pager log --follow --oneline --stat=200 -M50% -- vendor/github.com/containers/storage/storage.conf
  afaa9e7f Bump github.com/containers/storage from 1.15.1 to 1.15.2
   vendor/github.com/containers/storage/storage.conf | 3 ---
   1 file changed, 3 deletions(-)
  39ff039b Image encryption/decryption support in skopeo
   vendor/github.com/containers/storage/storage.conf | 44 +++++++++++++++++++++++++-------------------
   1 file changed, 25 insertions(+), 19 deletions(-)
  05ae513b Bump github.com/containers/buildah from 1.8.4 to 1.11.4
   vendor/github.com/containers/storage/storage.conf | 7 -------
   1 file changed, 7 deletions(-)
  700b3102 update github.com/containers/{image,storage}
   vendor/github.com/containers/storage/storage.conf | 8 ++++++++
   1 file changed, 8 insertions(+)
  033b2902 migrate to go modules
   vendor/github.com/containers/storage/storage.conf | 130 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   1 file changed, 130 insertions(+)
  $ git --no-pager log --follow --oneline --stat=200 -M50% 033b2902^ -- contrib/storage.conf
  fe259105 add storage.conf and manpage in contrib/
   contrib/storage.conf | 28 ++++++++++++++++++++++++++++
   1 file changed, 28 insertions(+)
  $ for HASH in fe259105 033b2902 700b3102 05ae513b 39ff039b afaa9e7f; do git describe --contains "${HASH}"; done
  v0.1.29~3^2
  v0.1.38~14^2~2
  v0.1.39~1
  v0.1.41~25^2
  v0.1.41~21^2
  v0.1.41~12^2

So changes may have been made in 0.1.29 (when the file landed for the
first time, likely from wherever we store post-Git patches), and were
likely made in 0.1.38, 0.1.39, and 0.1.41.

Comparing with our machine-os-content, that means vulnerable
transitions are:

* 4.1.* -> 4.1.34, since 4.1.31 -> 4.1.34 takes containers-common from
  0.1.37 to 0.1.40, picking up the v0.1.38~14^2~2 and v0.1.39~1 bumps.
  There may be no safe way to get to 4.1.34.

* 4.1.* -> 4.2...  FIXME

* 4.2.16 and earler -> 4.2.19, since 4.2.18 -> 4.2.19 takes
  containers-common from 0.1.37 to 0.1.40, picking up the
  v0.1.38~14^2~2 and v0.1.39~1 bumps.  4.2.16 and earlier -> 4.2.18 is
  fine, because there were no RPM-induced storage.conf bumps.  4.2.18
  -> 4.2.* is fine, because 4.2.18 has the patched machine-config
  source.

* 4.2.16 and earlier -> 4.3, since 4.2.18 -> 4.3 takes
  containers-common from 0.1.37 to 0.1.40, picking up the
  v0.1.38~14^2~2 and v0.1.39~1 bumps.  4.2.18 -> 4.3 is fine, because
  4.2.18 has the patched machine-config source.

* 4.3 -> 4.3 are fine, since they all have the patched machine-config
  source.

So ideally this pull would block edges from 4.2.16 and earlier into
4.3.  But because blocked-edges requires explicit to, I've just added
the 4.3.0 blocker (other 4.3.z releases either already blocked 4.2.*
or only give 4.2.18+ as update sources).  I've also dropped 4.2.16
from the *-4.3 channels with a comment about this bug.  There
shouldn't be much pushback on pulling the edge, because users can
still move from 4.2 to 4.3 via 4.2.19 -> 4.3.2.

Also simplify the wording on the GCP bug 1793635, which remains
unfixed.

[1]: openshift/machine-config-operator#1320 (comment)
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1782152#c5
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1781708#c0
[4]: https://github.com/openshift/machine-config-operator/pull/1320/files
[5]: openshift/machine-config-operator#1190
[6]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/MachineConfiguration.md
[7]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/ContainerRuntimeConfigDesign.md
[8]: openshift/machine-config-operator#330 (comment)
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1782153
[10]: openshift/machine-config-operator#1382 (comment)
[11]: openshift/machine-config-operator#1323 (comment)
[12]: https://bugzilla.redhat.com/show_bug.cgi?id=1782149
[13]: openshift/machine-config-operator#1322 (comment)
[14]: https://gitlab.cee.redhat.com/coretools/differ
      Internal link, sorry :/  But you can also browse the history at:
      https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.2&release=42.81.20200114.0 etc.
[15]: https://access.redhat.com/downloads/content/290/ver=4.2/rhel---8/4.2.0/x86_64/packages
[16]: https://access.redhat.com/downloads/content/rhel---8/x86_64/8841/containers-common/0.1.32-5.git1715c90.el8/x86_64/fd431d51/package
[17]: https://github.com/containers/skopeo/
  • Loading branch information
wking committed Feb 20, 2020
1 parent efe157e commit 1870ff9
Show file tree
Hide file tree
Showing 7 changed files with 12 additions and 4 deletions.
3 changes: 2 additions & 1 deletion blocked-edges/4.3.0-rc.0.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@ to: 4.3.0-rc.0
from: .*
# 4.2 -> 4.3 updates occasionally hit FailedCreatePodSandBox events, fixed in rc.3, but in neither 4.2.16 nor rc.0: https://bugzilla.redhat.com/show_bug.cgi?id=1787635
# 4.2 -> 4.3 updates occasionally hit RequiredPoolsFailed degradation, fixed in 4.2.16 and rc.3, but in neither 4.2.13 nor rc.0: https://bugzilla.redhat.com/show_bug.cgi?id=1786993
# 4.2 -> 4.3 updates occasionally hit RequiredPoolsFailed degradation, fixed in 4.2.18 and rc.0, but not in 4.2.16: https://bugzilla.redhat.com/show_bug.cgi?id=1782152 https://bugzilla.redhat.com/show_bug.cgi?id=1782149
# 4.2 -> 4.3 updates occasionally hit RouteHealthDegraded degradation, fixed in rc.0, but not in 4.2.16: https://bugzilla.redhat.com/show_bug.cgi?id=1790704
# 4.2.* -> 4.3.0-rc.0 Sometimes workloads on GCP are unreachable during 4.2.x to 4.3.0 upgrade sometimes: https://bugzilla.redhat.com/show_bug.cgi?id=1793635
# 4.2 -> 4.3 updates on GCP occasionally have unreachable workloads: https://bugzilla.redhat.com/show_bug.cgi?id=1793635
3 changes: 2 additions & 1 deletion blocked-edges/4.3.0-rc.3.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
to: 4.3.0-rc.3
from: 4\.2\..*
# 4.2 -> 4.3 updates occasionally hit FailedCreatePodSandBox events, fixed in rc.3, but not in 4.2.16: https://bugzilla.redhat.com/show_bug.cgi?id=1787635
# 4.2 -> 4.3 updates occasionally hit RequiredPoolsFailed degradation, fixed in 4.2.18 and rc.0, but not in 4.2.16: https://bugzilla.redhat.com/show_bug.cgi?id=1782152 https://bugzilla.redhat.com/show_bug.cgi?id=1782149
# 4.2 -> 4.3 updates occasionally hit RouteHealthDegraded degradation, fixed in rc.0, but not in 4.2.16: https://bugzilla.redhat.com/show_bug.cgi?id=1790704
# 4.2.* -> 4.3.0-rc.3 Sometimes workloads on GCP are unreachable during 4.2.x to 4.3.0 upgrade sometimes: https://bugzilla.redhat.com/show_bug.cgi?id=1793635
# 4.2 -> 4.3 updates on GCP occasionally have unreachable workloads: https://bugzilla.redhat.com/show_bug.cgi?id=1793635
4 changes: 4 additions & 0 deletions blocked-edges/4.3.0.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
to: 4.3.0
from: 4\.2\..*
# 4.2 -> 4.3 updates occasionally hit RequiredPoolsFailed, fixed in 4.2.18 and rc.0, but not in 4.2.16: https://bugzilla.redhat.com/show_bug.cgi?id=1782152 https://bugzilla.redhat.com/show_bug.cgi?id=1782149
# 4.2 -> 4.3 updates on GCP occasionally have unreachable workloads: https://bugzilla.redhat.com/show_bug.cgi?id=1793635
1 change: 1 addition & 0 deletions blocked-edges/4.3.1.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
to: 4.3.1
from: 4\.2\.18
# 4.2.18 is baked into 4.3.1 as a recommended update source, but we don't have a 4.2.18 release yet. Block until we have a release, to avoid accidentally adding 4.2.18 -> 4.3.1 to channels if 4.2.18 ends up being a dud.
# 4.2 -> 4.3 updates on GCP occasionally have unreachable workloads: https://bugzilla.redhat.com/show_bug.cgi?id=1793635
2 changes: 1 addition & 1 deletion channels/candidate-4.3.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name: candidate-4.3
versions:
# until s390 is released on 4.3 we may not want to include it in 4.3 channels
- 4.2.16+amd64
# 4.2 -> 4.3 updates occasionally hit RequiredPoolsFailed, fixed in 4.2.18 and rc.0, but not in 4.2.16: https://bugzilla.redhat.com/show_bug.cgi?id=1782152 https://bugzilla.redhat.com/show_bug.cgi?id=1782149
# not 4.2.17 because we had a long quiet time after 4.2.16 with no releases
- 4.2.18+amd64
- 4.2.19+amd64
Expand Down
2 changes: 1 addition & 1 deletion channels/fast-4.3.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name: fast-4.3
versions:
# until s390 is released on 4.3 we may not want to include it in 4.3 channels
- 4.2.16+amd64
# 4.2 -> 4.3 updates occasionally hit RequiredPoolsFailed, fixed in 4.2.18 and rc.0, but not in 4.2.16: https://bugzilla.redhat.com/show_bug.cgi?id=1782152 https://bugzilla.redhat.com/show_bug.cgi?id=1782149
# not 4.2.17 because we had a long quiet time after 4.2.16 with no releases
- 4.2.18+amd64
- 4.3.0
Expand Down
1 change: 1 addition & 0 deletions channels/stable-4.3.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
name: stable-4.3
versions:
# until s390 is released on 4.3 we may not want to include it in 4.3 channels
# 4.2 -> 4.3 updates occasionally hit RequiredPoolsFailed, fixed in 4.2.18 and rc.0, but not in 4.2.16: https://bugzilla.redhat.com/show_bug.cgi?id=1782152 https://bugzilla.redhat.com/show_bug.cgi?id=1782149
# not 4.2.17 because we had a long quiet time after 4.2.16 with no releases
- 4.3.0
- 4.3.1

0 comments on commit 1870ff9

Please sign in to comment.