Skip to content

Latest commit

 

History

History
229 lines (176 loc) · 14.5 KB

README.md

File metadata and controls

229 lines (176 loc) · 14.5 KB

Cincinnati Graph Data

Cincinnati is an update protocol designed to facilitate automatic updates. This repository manages the Cincinnati graph for OpenShift.

All of this data feeds the OpenShift Update Service, and describes the schema and APIs that graph-data administrators can use to configure that service. Nothing in this repository sets policy on which updates are supported for which clusters (which is downstream of the update service). Nothing in this repository sets policy for how graph-data administrators decide to use the available graph-data (that policy is internal, but the public commitments are covered in product docs like these.

Workflow

The contributing documentation covers licencing and the usual Git flow.

  1. Create a PR.
  2. Merge the PR to master.
  3. Update your local master branch.

Cincinnati is configured to track the master branch, so it will automatically react to updates made to this repository.

Schema Version

The layout of this repository is versioned via a version file, which contains the Semantic Version of the schema. As a schema version, the patch level is likely to remain 0, but the minor version will be incremented if backwards-compatible features are added, and the major version will be incremented if backwards-incompatible changes are made. Consumers, such as Cincinnati, who support x.y.0 may safely consume this repository when the stated major version matches the understood x and the stated minor version is less than or equal to the understood y. For example, a consumer that supports 1.3.0 and 2.1.0 could safely consume 1.2.0, 1.3.0, 2.0.0, 2.1.0, etc., but could not safely consume 1.4.0, 2.2.0, 3.0.0, etc.

Release Names

Release names are used for adding releases to channels and blocking edges. Architecture-agnostic names will apply to all images with that exact name in the version property of the release-metadata file included in the release image. Names with SemVer build metadata will apply only to releases whose exact name in the version property matches the release with the build metadata removed and whose referenced image architecture matches the given build metadata. For example, 4.2.14 will apply to both the amd64 and s390x release images, since those both have 4.2.14 in version. And 4.2.14+amd64 would only apply to the amd64 release image.

Add Releases To Channels

Edit the appropriate file in channels/ (1.0.0). For example, to add a release to candidate-4.2 you would edit channels/candidate-4.2.yaml.

The file contains a list of versions. Please keep the versions in order. And LEAVE COMMENTS if you skip a version.

Feeder Channels

Channel semantics, as documented here, show nodes and edges being promoted to successive channels as they prove their stability. For example, a 4.2.z release will appear in candidate-4.2 first. Upon proving itself sufficiently stable in the candidate channel, it will be promoted into fast-4.2. Some time after landing in fast-4.2, it will appear in stable-4.2.

Note: Once we have phased release rollouts, we will drop the fast/stable distinction from this repository and promote to a unified fast/stable channel with a start time and rollout duration. Until then, we are using fast channels to feed stable channels with a delay, just like candidate channels feed fast channels.

In this repository, the intended promotion flow is reflected by a feeder property in the channel declaration, since version 1.0.0. For example, for channels/fast-4.2.yaml:

feeder:
  name: candidate-4.2
  errata: public
  filter: 4\.[0-9]+\.[0-9]+(.*hotfix.*|\+amd64|-s390x)?

which declares the intention that nodes and edges will be considered for promotion from candidate-4.2 into fast-4.2 after the errata becomes public. The optional errata property (1.0.0) only accepts one value, public, and marks a public errata as sufficient, but not necessary, for promoting a feeder node. The filter value (1.0.0) excludes 4.2.0-rc.5 and other releases, while allowing for 4.2.0-0.hotfix-2020-09-19-234758 and 4.2.10-s390x and 4.2.14+amd64.

Another example is channels/stable-4.2.yaml:

feeder:
  name: fast-4.2
  delay: PT48H

which declares the intention that nodes and edges will be considered for promotion from fast-4.2 into stable-4.2 after a delay of 48 hours. The delay value (1.0.0) is an ISO 8601 duration, and spending sufficient time in the feeder channel is sufficient, but not necessary, for promoting the feeder node.

If both errata and delay are set, the feeder nodes will be promoted when delay has elapsed or the release errata becomes public, whichever comes first.

To see recommended feeder promotions, run:

$ hack/stabilization-change.py
Tombstones

Removing a node from a channel can strand existing clusters with a VersionNotFound error. To avoid that, unstable nodes are left in their existing channels, but should not be promoted to additional channels. This is reflected through entries in the optional tombstones property, since version 1.0.0. For example, channels/candidate-4.2.yaml has:

tombstones:
- 4.1.18
- 4.1.20

declaring that, while 4.1.18 and 4.1.20 are in candidate-4.2, they should not be promoted to subsequent channels (in this case, fast-4.2).

Block Edges

Create/edit an appropriate file in blocked_edges/, since version 1.0.0, to drop update edges. Since version 1.1.0, additional properties are available to declare an update conditional, and express any known risks, instead of dropping it completely.

  • to (1.0.0, required, string) is the release which has the existing incoming edges.

  • from (1.0.0, required, string) is a regex for the previous release versions. If you want to require from to match the full version string (and not just a substring), you must include explicit ^ and $ anchors. Release version strings will receive the architecture-suffix before being compared to this regular expression.

  • url (1.1.0, optional, string), with a URI documenting the blocking reason. For example, this could link to a bug's impact statement or knowledge-base article.

  • name (1.1.0, optional, string), with a CamelCase reason suitable for a ClusterOperatorStatusCondition reason property.

  • fixedIn (1.1.0, optional, string), with the update-target release where the exposure was fixed, either directly, or because that target is newer than the 4.(y-1).z release where the exposure was fixed. This feeds risk-extension guards that require either a fixedIn declaration or an extension of unfixed risks to later releases to avoid shipping a release that is still exposed to a risk without declaring that risk.

  • autoExtend (1.1.0, optional, string), with a URI tracking dropping the risk from new releases. Some risks have "has this been fixed?" deferred to other teams. Setting this property records that deferral, so graph-data adminstrators know when they do not need to perform that assessment themselves with each new 4.y patch release.

  • message (1.1.0, optional, string), with a human-oriented message describing the blocking reason, suitable for a ClusterOperatorStatusCondition message property.

  • matchingRules (1.1.0, optional, array), defining conditions for deciding which clusters have the update recommended and which do not. The array is ordered by decreasing precedence. Consumers should walk the array in order. For a given entry, if a condition type is unrecognized, or fails to evaluate, consumers should proceed to the next entry. If a condition successfully evaluates (either as a match or as an explicit does-not-match), that result is used, and no further entries should be attempted. If no condition can be successfully evaluated, the update should not be recommended. Each entry must be an object with at least the following property:

    Additional properties for each entry are defined in the cluster-condition type registry.

For example: to block all incoming edges to a release create a file such as blocked-edges/4.2.11.yaml containing:

to: 4.2.11
from: .*

If you wish to block specific edges it might look like:

to: 4.2.0-rc.5
from: ^4\.1\.(18|20)[+].*$

The [+].* portion absorbs the architecture-suffix from the release name that consumers will use for comparisons.

Risks for managed clusters

If site reliability engineers want to declare a risk for managed clusters updating into a release:

  1. For issues expected to be related to the core OpenShift payload, ensure there is an OCPBUGS issue, opening a new issue if necessary.
  2. Add the UpgradeBlocker label to the bug to initiate generic risk assessment.

If you want to let that process cook, you're done :). If you want to declare your own risk in the meantime (or instead, if generic risk assessment decides the risk is not worth declaring, or that the confusion generated by a declaration outweighs the increased visibilty a declaration would deliver):

  1. Pick an impacted target release for to, e.g. 4.13.4.
  2. Build a regular expression for relevant source releases (which would pick up the risk by updating into to), e.g. .* for "all releases", for from.
  3. Find (or create) a URI documenting the risk, e.g. https://access.redhat.com/solutions/7024726 or similar KCS, for url. The linked document should explain that the declaration is specific to managed clusters, so customer-managed cluster administrators who hear about the declaration can easily see that it is not aimed at them.
  4. Create a PascalCaseSlug for the risk, e.g. MultiNetworkAttachmentsWhereaboutsVersion for name. See existing names for inspiration; you want something that is unique to the issue, and unlikely to overlap with future issues. Again, avoid giving the impression that you are talking to customer-managed cluster administrators, e.g. ARO... instead of Azure... for a declaration aimed at the ARO fleet (one managed subset of the Azure fleet).
  5. Create a sentence or two summarizing the risk for message. An English version of the PromQL filter is a good start, e.g. "All GCP OSD clusters...", and then finish with a quick summary of the impact. Again, avoid giving the impression that you are talking to customer-managed cluster administrators.

And then create a file blocked-edges/${TO}-${NAME}.yaml, e.g. blocked-edges/4.13.4-MultiNetworkAttachmentsWhereaboutsVersion.yaml with the following content:

to: FIXME
from: FIXME
url: FIXME
name: FIXME
message: |-
  FIXME
matchingRules:
- type: PromQL
  promql:
    promql:
      group(sre:telemetry:managed_labels{_id="",sre="true"})
      or
      0 * group(cluster_version{_id=""})

to declare that risk only for managed clusters. See here for an example where the values are filled in, although that is using different PromQL, and not the managed-cluster selecting PromQL from the above template.

sre:telemetry:managed_labels does not exist in ARO, and in that case use:

matchingRules:
- type: PromQL
  promql:
    promql:
      group(cluster_operator_conditions{_id="",name="aro"})
      or
      0 * group(cluster_operator_conditions{_id=""})

If the risk applies to multiple target releases, create multiple files with different to.

Signatures

Add release signatures under signatures/{algorithm}/{digest}/signature-{number} (1.2.0). For example, the amd64 4.12.0 is sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 (errata), and would have signatures stored in signatures/sha256/4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18/signature-1 (optionally with additional signatures as signature-2, etc.