-
Notifications
You must be signed in to change notification settings - Fork 478
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
88b00b6
commit b465086
Showing
1 changed file
with
302 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,302 @@ | ||
--- | ||
title: Synchronized Upgrades Between Clusters | ||
authors: | ||
- "@danwinship" | ||
reviewers: | ||
- "@dhellmann" | ||
- "@sdodson" | ||
- "@wking" | ||
- "@romfreiman" | ||
- "@zshi" | ||
approvers: | ||
- TBD | ||
api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers) | ||
- TBD | ||
creation-date: 2021-01-11 | ||
last-updated: 2021-01-11 | ||
tracking-link: | ||
- https://issues.redhat.com/browse/SDN-2603 | ||
see-also: | ||
- "/enhancements/network/dpu/overview.md" | ||
--- | ||
|
||
# Synchronized Upgrades Between Clusters | ||
|
||
## Summary | ||
|
||
Our architecture for [using DPUs (eg, BlueField-2 NICs) in | ||
OCP](../network/dpu/overview.md) involves creating one ordinary | ||
"tenant" OCP cluster consisting of x86 hosts, and a second "infra" OCP | ||
cluster consisting of the ARM systems running on the NICs installed in | ||
the x86 hosts. | ||
|
||
In the current Dev Preview of DPU support, there is no coordination | ||
between the infra and tenant clusters during upgrades. Thus, when | ||
upgrading the infra cluster, each NIC will end up rebooting at some | ||
point, after it has drained _its own_ pods, but without having made | ||
any attempt to drain the pods of its x86 tenant node. When this | ||
happens, the x86 node (and its pods) will lose network connectivity | ||
until the NIC finishes rebooting. | ||
|
||
In order to make upgrades in DPU clusters work smoothly, we need some | ||
way to synchronize the reboots between the two clusters, so that the | ||
NICs get rebooted during the period when their corresponding x86 hosts | ||
are cordoned and drained. | ||
|
||
## Glossary | ||
|
||
(Copied from the [DPU Overview | ||
Enhancement](../network/dpu/overview.md).) | ||
|
||
- **DPU** / **Data Processing Unit** - a Smart NIC with a full CPU, | ||
RAM, storage, etc, running a full operating system, able to offload | ||
network processing and potentially other tasks from its host system. | ||
For example, the NVIDIA BlueField-2. | ||
|
||
- **DPU NIC** - refers to the DPU and its CPU, OS, etc (as opposed to | ||
the CPU, OS, etc, of the x86 server that the DPU is installed in). A | ||
**DPU NIC node** or **DPU NIC worker** is an OCP worker node | ||
running on a DPU NIC. | ||
|
||
- **DPU Host** - refers to an x86 server (and its CPU, OS, etc) which | ||
contains a DPU. A **DPU Host node** or **DPU Host worker** is an OCP | ||
worker node running on a DPU Host. | ||
|
||
- **Two-Cluster Design** - any architecture for deploying an OCP | ||
cluster with (some) nodes containing DPUs, where the DPUs are nodes | ||
in a second OCP cluster. (For purposes of this term, the | ||
architecture still counts as "two-cluster" even if HyperShift is | ||
involved and there are actually three clusters.) | ||
|
||
- **Infra Cluster** - in a Two-Cluster Design, the OCP cluster which | ||
contains the DPU NIC workers. (It may also include master nodes | ||
and/or non-DPU workers, depending on the particular details of the | ||
two-cluster design.) | ||
|
||
- **Tenant Cluster** - in a Two-Cluster Design, the OCP cluster | ||
containing the DPU Host workers. (It may also include master nodes | ||
and/or workers without DPUs.) | ||
|
||
## Motivation | ||
|
||
### Goals | ||
|
||
- Make upgrades work smoothly in clusters running with DPU support, by | ||
synchronizing the reboots of nodes between the infra cluster and the | ||
tenant cluster. | ||
|
||
### Non-Goals | ||
|
||
- Supporting synchronized upgrades of more than 2 clusters at once. | ||
|
||
## Proposal | ||
|
||
### User Stories | ||
|
||
As the administrator of a cluster using DPUs, I want to be able to do | ||
y-stream and z-stream upgrades without causing unnecessary network outages. | ||
|
||
### API Extensions | ||
|
||
TBD | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
#### Setup | ||
|
||
We can set some things up at install time (eg, creating credentials to | ||
allow certain operators in the two clusters to talk to each other). | ||
|
||
#### Initiating Upgrades | ||
|
||
As part of the DPU security model, the tenant cluster cannot have any | ||
power over the infra cluster. In particular, it can't be possible for | ||
an administrator in the tenant cluster to force the infra cluster to | ||
upgrade/downgrade to any particular version. | ||
|
||
Thus, the infra cluster upgrade must be initiated on the infra cluster | ||
side. In theory, we could have the infra cluster then initiate the | ||
tenant cluster upgrade, but that would require having a tenant cluster | ||
cluster-admin admin credential lying around in the infra cluster | ||
somewhat unnecessarily. It probably makes more sense to require the | ||
administrator to initiate the upgrade on the tenant side as well. | ||
|
||
#### Inter-Cluster Version Skew | ||
|
||
There is very little explicit communication between the two clusters: | ||
|
||
1. The dpu-operator in the infra cluster talks to the apiserver in | ||
the tenant cluster. | ||
|
||
2. The ovn-kubernetes components in the two clusters do not | ||
communicate directly, but they do make some assumptions about | ||
each other's behavior (which is necessary to coordinate the | ||
plumbing of pod networking). | ||
|
||
The apiserver communication is very ordinary and boring and not likely | ||
to be subject to any interesting version skew problems (even if we | ||
ended up with, say, a 3-minor-version skew between the clusters). | ||
|
||
Thus, other than ovn-kubernetes, no OCP component needs to be | ||
concerned about version skew between the two clusters, because the | ||
components in the two clusters are completely unaware of each other. | ||
|
||
For ovn-kubernetes, if we ever change the details of the cross-cluster | ||
communication, then we will need to add proper checks to enforce | ||
tolerable cross-cluster skew at that time. | ||
|
||
## Design Details | ||
|
||
The upgrade plan takes advantage of two things: | ||
|
||
- Other than the reboots, no part of the upgrade needs to be | ||
synchronized between the two clusters. | ||
|
||
- Each infra node only runs pods to support the pods on its | ||
corresponding tenant node. Thus, if a tenant node reboots, it is | ||
safe to immediately reboot its infra node as well without needing | ||
to cordon and drain it, because it is guaranteed that none of its | ||
pods are doing anything important at that point. | ||
|
||
So, both clusters' upgrades can proceed normally up until the MCO | ||
upgrade. | ||
|
||
In the infra cluster, rather than cordoning, draining, and rebooting | ||
nodes one-by-one, the MCO will instead queue up the new RHCOS image on | ||
each node, and then mark the node (somehow) as "waiting for a reboot". | ||
Then the MCO will watch the nodes and wait for each one to reboot (of | ||
its own accord), and then finally mark itself fully-upgraded after | ||
that happens. | ||
|
||
On the infra nodes, when they are "waiting for a reboot", they will | ||
monitor the state of the ovn-kubernetes network; when they determine | ||
that the tenant node is rebooting, the infra node will then also | ||
reboot. | ||
|
||
In terms of synchronization, this means that we want to avoid the | ||
tenant cluster reaching the MCO upgrade stage when the infra cluster | ||
is not ready for the tenant cluster to start rebooting nodes: | ||
|
||
- One way to do this would be to not let a tenant cluster start an | ||
upgrade until the infra cluster has already reached the "waiting | ||
for reboots" stage. This could be done by having an operator in | ||
the tenant cluster whose status is always `Upgradeable: False` | ||
except when the infra cluster tells it that it is safe to | ||
proceed. | ||
|
||
- Another possibility would be to have an operator that sits just | ||
before the MCO in the upgrade order, which checks if the infra | ||
cluster is in the middle of an upgrade, and if so, blocks until | ||
the infra cluster is ready for the tenant cluster to proceed. (Or | ||
rather, the infra cluster would notify it that it should/shouldn't | ||
block the upgrade.) | ||
|
||
- This would allow for the possibility of doing tenant cluster | ||
upgrades without corresponding infra cluster upgrades, which | ||
might be useful? | ||
|
||
### Open Questions | ||
|
||
- How exactly will the coordination between the two MCO upgrades | ||
occur? | ||
|
||
- How exactly will infra nodes detect their tenant node rebooting? | ||
|
||
- Do we want to allow (z-stream) upgrades of the tenant cluster | ||
without also upgrading the infra cluster? | ||
|
||
### Risks and Mitigations | ||
|
||
The infra cluster could remain stuck at the pre-reboot stage of the | ||
upgrade for arbitrarily long. In theory this should not cause any | ||
problems, as all operators must be able to handle running with the old | ||
RHCOS version anyway. But we should make sure that alerts eventually | ||
get fired in this case. | ||
|
||
More TBD | ||
|
||
### Test Plan | ||
|
||
TBD | ||
|
||
### Graduation Criteria | ||
|
||
TBD | ||
|
||
#### Dev Preview -> Tech Preview | ||
|
||
#### Tech Preview -> GA | ||
|
||
#### Removing a deprecated feature | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
This is a modification to the upgrade process, not something that can | ||
be upgrade or downgraded on its own. | ||
|
||
TBD, as the details depend on the eventual design. | ||
|
||
### Version Skew Strategy | ||
|
||
TBD, as the details depend on the eventual design. | ||
|
||
### Operational Aspects of API Extensions | ||
|
||
TBD | ||
|
||
#### Failure Modes | ||
|
||
- The system might get confused and spuriously block upgrades that | ||
should be allowed. | ||
|
||
- Communications failures might lead to upgrades failing without the | ||
tenant cluster being able to figure out why they failed. | ||
|
||
- TBD | ||
|
||
#### Support Procedures | ||
|
||
TBD | ||
|
||
## Implementation History | ||
|
||
- Initial proposal: 2021-01-11 | ||
- Updated for initial feedback: 2021-01-24 | ||
|
||
## Drawbacks | ||
|
||
This makes the upgrade process more complicated, which risks rendering | ||
clusters un-upgradeable without manual intervention. | ||
|
||
However, without some form of synchronization, it is impossible to | ||
have non-disruptive tenant cluster upgrades. | ||
|
||
## Alternatives | ||
|
||
### Never Reboot the DPUs | ||
|
||
This implies never upgrading OCP on the DPUs. I don't see how this | ||
could work. | ||
|
||
### Don't Have an Infra Cluster | ||
|
||
If the DPUs were not all part of a single OCP cluster (for example, | ||
they were just "bare" RHCOS hosts, or they were each running | ||
Single-Node OpenShift), then it might be simpler to synchronize the | ||
DPU upgrades with the tenant upgrades, because then each tenant could | ||
coordinate the actions of its own DPU by itself. | ||
|
||
The big problem with this is that, for security reasons, we don't want | ||
the tenants to have any control over their DPUs. (For some future use | ||
cases, the DPUs will be used to enforce security policies on their | ||
tenants.) | ||
|
||
### More closely coordinated reboots | ||
|
||
In the original proposal, the infra and tenant reboots were more | ||
closely coordinated, with the infra and tenant MCOs communicating so | ||
that the infra MCO could reboot each infra node at the same time as | ||
the corresponding tenant node rebooted. But this (probably) turns out | ||
to be unnecessary, as it should always be safe to just reboot the | ||
infra nodes when their tenant nodes reboot, without needing to have | ||
"planned" the reboot. |