initial wip synchronized-upgrades

openshift · Feb 17, 2022 · b465086 · b465086
1 parent 88b00b6
commit b465086
Showing 1 changed file with 302 additions and 0 deletions.
diff --git a/enhancements/update/synchronized-upgrades.md b/enhancements/update/synchronized-upgrades.md
@@ -0,0 +1,302 @@
+---
+title: Synchronized Upgrades Between Clusters
+authors:
+  - "@danwinship"
+reviewers:
+  - "@dhellmann"
+  - "@sdodson"
+  - "@wking"
+  - "@romfreiman"
+  - "@zshi"
+approvers:
+  - TBD
+api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers)
+  - TBD
+creation-date: 2021-01-11
+last-updated: 2021-01-11
+tracking-link:
+  - https://issues.redhat.com/browse/SDN-2603
+see-also:
+  - "/enhancements/network/dpu/overview.md"
+---
+
+# Synchronized Upgrades Between Clusters
+
+## Summary
+
+Our architecture for [using DPUs (eg, BlueField-2 NICs) in
+OCP](../network/dpu/overview.md) involves creating one ordinary
+"tenant" OCP cluster consisting of x86 hosts, and a second "infra" OCP
+cluster consisting of the ARM systems running on the NICs installed in
+the x86 hosts.
+
+In the current Dev Preview of DPU support, there is no coordination
+between the infra and tenant clusters during upgrades. Thus, when
+upgrading the infra cluster, each NIC will end up rebooting at some
+point, after it has drained _its own_ pods, but without having made
+any attempt to drain the pods of its x86 tenant node. When this
+happens, the x86 node (and its pods) will lose network connectivity
+until the NIC finishes rebooting.
+
+In order to make upgrades in DPU clusters work smoothly, we need some
+way to synchronize the reboots between the two clusters, so that the
+NICs get rebooted during the period when their corresponding x86 hosts
+are cordoned and drained.
+
+## Glossary
+
+(Copied from the [DPU Overview
+Enhancement](../network/dpu/overview.md).)
+
+- **DPU** / **Data Processing Unit** - a Smart NIC with a full CPU,
+  RAM, storage, etc, running a full operating system, able to offload
+  network processing and potentially other tasks from its host system.
+  For example, the NVIDIA BlueField-2.
+
+- **DPU NIC** - refers to the DPU and its CPU, OS, etc (as opposed to
+  the CPU, OS, etc, of the x86 server that the DPU is installed in). A
+  **DPU NIC node** or **DPU NIC worker** is an OCP worker node
+  running on a DPU NIC.
+
+- **DPU Host** - refers to an x86 server (and its CPU, OS, etc) which
+  contains a DPU. A **DPU Host node** or **DPU Host worker** is an OCP
+  worker node running on a DPU Host.
+
+- **Two-Cluster Design** - any architecture for deploying an OCP
+  cluster with (some) nodes containing DPUs, where the DPUs are nodes
+  in a second OCP cluster. (For purposes of this term, the
+  architecture still counts as "two-cluster" even if HyperShift is
+  involved and there are actually three clusters.)
+
+- **Infra Cluster** - in a Two-Cluster Design, the OCP cluster which
+  contains the DPU NIC workers. (It may also include master nodes
+  and/or non-DPU workers, depending on the particular details of the
+  two-cluster design.)
+
+- **Tenant Cluster** - in a Two-Cluster Design, the OCP cluster
+  containing the DPU Host workers. (It may also include master nodes
+  and/or workers without DPUs.)
+
+## Motivation
+
+### Goals
+
+- Make upgrades work smoothly in clusters running with DPU support, by
+  synchronizing the reboots of nodes between the infra cluster and the
+  tenant cluster.
+
+### Non-Goals
+
+- Supporting synchronized upgrades of more than 2 clusters at once.
+
+## Proposal
+
+### User Stories
+
+As the administrator of a cluster using DPUs, I want to be able to do
+y-stream and z-stream upgrades without causing unnecessary network outages.
+
+### API Extensions
+
+TBD
+
+### Implementation Details/Notes/Constraints
+
+#### Setup
+
+We can set some things up at install time (eg, creating credentials to
+allow certain operators in the two clusters to talk to each other).
+
+#### Initiating Upgrades
+
+As part of the DPU security model, the tenant cluster cannot have any
+power over the infra cluster. In particular, it can't be possible for
+an administrator in the tenant cluster to force the infra cluster to
+upgrade/downgrade to any particular version.
+
+Thus, the infra cluster upgrade must be initiated on the infra cluster
+side. In theory, we could have the infra cluster then initiate the
+tenant cluster upgrade, but that would require having a tenant cluster
+cluster-admin admin credential lying around in the infra cluster
+somewhat unnecessarily. It probably makes more sense to require the
+administrator to initiate the upgrade on the tenant side as well.
+
+#### Inter-Cluster Version Skew
+
+There is very little explicit communication between the two clusters:
+
+  1. The dpu-operator in the infra cluster talks to the apiserver in
+     the tenant cluster.
+
+  2. The ovn-kubernetes components in the two clusters do not
+     communicate directly, but they do make some assumptions about
+     each other's behavior (which is necessary to coordinate the
+     plumbing of pod networking).
+
+The apiserver communication is very ordinary and boring and not likely
+to be subject to any interesting version skew problems (even if we
+ended up with, say, a 3-minor-version skew between the clusters).
+
+Thus, other than ovn-kubernetes, no OCP component needs to be
+concerned about version skew between the two clusters, because the
+components in the two clusters are completely unaware of each other.
+
+For ovn-kubernetes, if we ever change the details of the cross-cluster
+communication, then we will need to add proper checks to enforce
+tolerable cross-cluster skew at that time.
+
+## Design Details
+
+The upgrade plan takes advantage of two things:
+
+- Other than the reboots, no part of the upgrade needs to be
+  synchronized between the two clusters.
+
+- Each infra node only runs pods to support the pods on its
+  corresponding tenant node. Thus, if a tenant node reboots, it is
+  safe to immediately reboot its infra node as well without needing
+  to cordon and drain it, because it is guaranteed that none of its
+  pods are doing anything important at that point.
+
+So, both clusters' upgrades can proceed normally up until the MCO
+upgrade.
+
+In the infra cluster, rather than cordoning, draining, and rebooting
+nodes one-by-one, the MCO will instead queue up the new RHCOS image on
+each node, and then mark the node (somehow) as "waiting for a reboot".
+Then the MCO will watch the nodes and wait for each one to reboot (of
+its own accord), and then finally mark itself fully-upgraded after
+that happens.
+
+On the infra nodes, when they are "waiting for a reboot", they will
+monitor the state of the ovn-kubernetes network; when they determine
+that the tenant node is rebooting, the infra node will then also
+reboot.
+
+In terms of synchronization, this means that we want to avoid the
+tenant cluster reaching the MCO upgrade stage when the infra cluster
+is not ready for the tenant cluster to start rebooting nodes:
+
+- One way to do this would be to not let a tenant cluster start an
+  upgrade until the infra cluster has already reached the "waiting
+  for reboots" stage. This could be done by having an operator in
+  the tenant cluster whose status is always `Upgradeable: False`
+  except when the infra cluster tells it that it is safe to
+  proceed.
+
+- Another possibility would be to have an operator that sits just
+  before the MCO in the upgrade order, which checks if the infra
+  cluster is in the middle of an upgrade, and if so, blocks until
+  the infra cluster is ready for the tenant cluster to proceed. (Or
+  rather, the infra cluster would notify it that it should/shouldn't
+  block the upgrade.)
+
+  - This would allow for the possibility of doing tenant cluster
+    upgrades without corresponding infra cluster upgrades, which
+    might be useful?
+
+### Open Questions
+
+- How exactly will the coordination between the two MCO upgrades
+  occur?
+
+- How exactly will infra nodes detect their tenant node rebooting?
+
+- Do we want to allow (z-stream) upgrades of the tenant cluster
+  without also upgrading the infra cluster?
+
+### Risks and Mitigations
+
+The infra cluster could remain stuck at the pre-reboot stage of the
+upgrade for arbitrarily long. In theory this should not cause any
+problems, as all operators must be able to handle running with the old
+RHCOS version anyway. But we should make sure that alerts eventually
+get fired in this case.
+
+More TBD
+
+### Test Plan
+
+TBD
+
+### Graduation Criteria
+
+TBD
+
+#### Dev Preview -> Tech Preview
+
+#### Tech Preview -> GA
+
+#### Removing a deprecated feature
+
+### Upgrade / Downgrade Strategy
+
+This is a modification to the upgrade process, not something that can
+be upgrade or downgraded on its own.
+
+TBD, as the details depend on the eventual design.
+
+### Version Skew Strategy
+
+TBD, as the details depend on the eventual design.
+
+### Operational Aspects of API Extensions
+
+TBD
+
+#### Failure Modes
+
+- The system might get confused and spuriously block upgrades that
+  should be allowed.
+
+- Communications failures might lead to upgrades failing without the
+  tenant cluster being able to figure out why they failed.
+
+- TBD
+
+#### Support Procedures
+
+TBD
+
+## Implementation History
+
+- Initial proposal: 2021-01-11
+- Updated for initial feedback: 2021-01-24
+
+## Drawbacks
+
+This makes the upgrade process more complicated, which risks rendering
+clusters un-upgradeable without manual intervention.
+
+However, without some form of synchronization, it is impossible to
+have non-disruptive tenant cluster upgrades.
+
+## Alternatives
+
+### Never Reboot the DPUs
+
+This implies never upgrading OCP on the DPUs. I don't see how this
+could work.
+
+### Don't Have an Infra Cluster
+
+If the DPUs were not all part of a single OCP cluster (for example,
+they were just "bare" RHCOS hosts, or they were each running
+Single-Node OpenShift), then it might be simpler to synchronize the
+DPU upgrades with the tenant upgrades, because then each tenant could
+coordinate the actions of its own DPU by itself.
+
+The big problem with this is that, for security reasons, we don't want
+the tenants to have any control over their DPUs. (For some future use
+cases, the DPUs will be used to enforce security policies on their
+tenants.)
+
+### More closely coordinated reboots
+
+In the original proposal, the infra and tenant reboots were more
+closely coordinated, with the infra and tenant MCOs communicating so
+that the infra MCO could reboot each infra node at the same time as
+the corresponding tenant node rebooted. But this (probably) turns out
+to be unnecessary, as it should always be safe to just reboot the
+infra nodes when their tenant nodes reboot, without needing to have
+"planned" the reboot.