diff --git a/enhancements/update/synchronized-upgrades.md b/enhancements/update/synchronized-upgrades.md new file mode 100644 index 00000000000..b3ab9d1345f --- /dev/null +++ b/enhancements/update/synchronized-upgrades.md @@ -0,0 +1,291 @@ +--- +title: Synchronized Upgrades Between Clusters +authors: + - "@danwinship" +reviewers: + - "@dhellmann" + - "@sdodson" + - "@wking" + - "@romfreiman" + - "@zshi" +approvers: + - TBD +api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers) + - TBD +creation-date: 2021-01-11 +last-updated: 2021-01-11 +tracking-link: + - https://issues.redhat.com/browse/SDN-2603 +see-also: + - "/enhancements/network/dpu/overview.md" +--- + +# Synchronized Upgrades Between Clusters + +## Summary + +In a [cluster with DPUs](../network/dpu/overview.md) (eg, BlueField-2 +NICs), the x86 hosts form one OCP cluster, and the DPU ARM systems +form a second OCP cluster. This makes upgrades to new OCP releases +complicated, because there is currently no way to synchronize upgrades +between the two clusters, but rebooting the DPU systems as part of the +MCO upgrade will cause a network outage on the x86 systems. In order +for upgrades to work smoothly, we need to synchronize the reboots +between the two clusters, so that the DPU systems are only rebooted +when their corresponding x86 hosts have been cordoned and drained. + +## Glossary + +(Copied from the [DPU Overview +Enhancement](../network/dpu/overview.md).) + +- **DPU** / **Data Processing Unit** - a Smart NIC with a full CPU, + RAM, storage, etc, running a full operating system, able to offload + network processing and potentially other tasks from its host system. + For example, the NVIDIA BlueField-2. + +- **DPU NIC** - refers to the DPU and its CPU, OS, etc (as opposed to + the CPU, OS, etc, of the x86 server that the DPU is installed in). A + **DPU NIC node** or **DPU NIC worker** is an OCP worker node + running on a DPU NIC. + +- **DPU Host** - refers to an x86 server (and its CPU, OS, etc) which + contains a DPU. A **DPU Host node** or **DPU Host worker** is an OCP + worker node running on a DPU Host. + +- **Two-Cluster Design** - any architecture for deploying an OCP + cluster with (some) nodes containing DPUs, where the DPUs are nodes + in a second OCP cluster. (For purposes of this term, the + architecture still counts as "two-cluster" even if HyperShift is + involved and there are actually three clusters.) + +- **Infra Cluster** - in a Two-Cluster Design, the OCP cluster which + contains the DPU NIC workers. (It may also include master nodes + and/or non-DPU workers, depending on the particular details of the + two-cluster design.) + +- **Tenant Cluster** - in a Two-Cluster Design, the OCP cluster + containing the DPU Host workers. (It may also include master nodes + and/or workers without DPUs.) + +## Motivation + +### Goals + +- Make upgrades work smoothly in clusters running with DPU support, by + synchronizing the reboots of nodes between the infra cluster and the + tenant cluster. + +### Non-Goals + +- Supporting synchronized upgrades of more than 2 clusters at once. + +## Proposal + +### User Stories + +As the administrator of a cluster using DPUs, I want to be able to do +y-stream and z-stream upgrades without causing unnecessary network outages. + +### API Extensions + +TBD + +### Implementation Details/Notes/Constraints + +#### Setup + +We can set some things up at install time (eg, creating credentials to +allow certain operators in the two clusters to talk to each other). + +#### Initiating Upgrades + +As part of the DPU security model, the tenant cluster cannot have any +power over the infra cluster. In particular, it can't be possible for +an administrator in the tenant cluster to force the infra cluster to +upgrade/downgrade to any particular version. + +Thus, the infra cluster upgrade must be initiated on the infra cluster +side. In theory, we could have the infra cluster then initiate the +tenant cluster upgrade, but that would require having a tenant cluster +cluster-admin admin credential lying around in the infra cluster +somewhat unnecessarily. It probably makes more sense to require the +administrator to initiate the upgrade on the tenant side as well. + +#### Inter-Cluster Version Skew + +There is very little explicit communication between the two clusters: + + 1. The dpu-operator in the infra cluster talks to the apiserver in + the tenant cluster. + + 2. The ovn-kubernetes components in the two clusters do not + communicate directly, but they do make some assumptions about + each other's behavior (which is necessary to coordinate the + plumbing of pod networking). + +The apiserver communication is very ordinary and boring and not likely +to be subject to any interesting version skew problems (even if we +ended up with, say, a 3-minor-version skew between the clusters). + +Thus, other than ovn-kubernetes, no OCP component needs to be +concerned about version skew between the two clusters, because the +components in the two clusters are completely unaware of each other. + +For ovn-kubernetes, if we ever change the details of the cross-cluster +communication, then we will need to add proper checks to enforce +tolerable cross-cluster skew at that time. + +## Design Details + +The upgrade plan takes advantage of two things: + + - Other than the reboots, no part of the upgrade needs to be + synchronized between the two clusters. + + - Each infra node only runs pods to support the pods on its + corresponding tenant node. Thus, if a tenant node reboots, it is + safe to immediately reboot its infra node as well without needing + to cordon and drain it, because it is guaranteed that none of its + pods are doing anything important at that point. + +So, both clusters' upgrades can proceed normally up until the MCO +upgrade. + +In the infra cluster, rather than cordoning, draining, and rebooting +nodes one-by-one, the MCO will instead queue up the new RHCOS image on +each node, and then mark the node (somehow) as "waiting for a reboot". +Then the MCO will watch the nodes and wait for each one to reboot (of +its own accord), and then finally mark itself fully-upgraded after +that happens. + +On the infra nodes, when they are "waiting for a reboot", they will +monitor the state of the ovn-kubernetes network; when they determine +that the tenant node is rebooting, the infra node will then also +reboot. + +In terms of synchronization, this means that we want to avoid the +tenant cluster reaching the MCO upgrade stage when the infra cluster +is not ready for the tenant cluster to start rebooting nodes: + + - One way to do this would be to not let a tenant cluster start an + upgrade until the infra cluster has already reached the "waiting + for reboots" stage. This could be done by having an operator in + the tenant cluster whose status is always `Upgradeable: False` + except when the infra cluster tells it that it is safe to + proceed. + + - Another possibility would be to have an operator that sits just + before the MCO in the upgrade order, which checks if the infra + cluster is in the middle of an upgrade, and if so, blocks until + the infra cluster is ready for the tenant cluster to proceed. (Or + rather, the infra cluster would notify it that it should/shouldn't + block the upgrade.) + + - This would allow for the possibility of doing tenant cluster + upgrades without corresponding infra cluster upgrades, which + might be useful? + +### Open Questions + +- How exactly will the coordination between the two MCO upgrades + occur? + +- How exactly will infra nodes detect their tenant node rebooting? + +- Do we want to allow (z-stream) upgrades of the tenant cluster + without also upgrading the infra cluster? + +### Risks and Mitigations + +The infra cluster could remain stuck at the pre-reboot stage of the +upgrade for arbitrarily long. In theory this should not cause any +problems, as all operators must be able to handle running with the old +RHCOS version anyway. But we should make sure that alerts eventually +get fired in this case. + +More TBD + +### Test Plan + +TBD + +### Graduation Criteria + +TBD + +#### Dev Preview -> Tech Preview + +#### Tech Preview -> GA + +### Upgrade / Downgrade Strategy + +This is a modification to the upgrade process, not something that can +be upgrade or downgraded on its own. + +TBD, as the details depend on the eventual design. + +### Version Skew Strategy + +TBD, as the details depend on the eventual design. + +### Operational Aspects of API Extensions + +TBD + +#### Failure Modes + +- The system might get confused and spuriously block upgrades that + should be allowed. + +- Communications failures might lead to upgrades failing without the + tenant cluster being able to figure out why they failed. + +- TBD + +#### Support Procedures + +TBD + +## Implementation History + +- Initial proposal: 2021-01-11 +- Updated for initial feedback: 2021-01-24 + +## Drawbacks + +This makes the upgrade process more complicated, which risks rendering +clusters un-upgradeable without manual intervention. + +However, without some form of synchronization, it is impossible to +have non-disruptive tenant cluster upgrades. + +## Alternatives + +### Never Reboot the DPUs + +This implies never upgrading OCP on the DPUs. I don't see how this +could work. + +### Don't Have an Infra Cluster + +If the DPUs were not all part of a single OCP cluster (for example, +they were just "bare" RHCOS hosts, or they were each running +Single-Node OpenShift), then it might be simpler to synchronize the +DPU upgrades with the tenant upgrades, because then each tenant could +coordinate the actions of its own DPU by itself. + +The big problem with this is that, for security reasons, we don't want +the tenants to have any control over their DPUs. (For some future use +cases, the DPUs will be used to enforce security policies on their +tenants.) + +### More closely coordinated reboots + +In the original proposal, the infra and tenant reboots were more +closely coordinated, with the infra and tenant MCOs communicating so +that the infra MCO could reboot each infra node at the same time as +the corresponding tenant node rebooted. But this (probably) turns out +to be unnecessary, as it should always be safe to just reboot the +infra nodes when their tenant nodes reboot, without needing to have +"planned" the reboot.