Skip to content

Commit

Permalink
initial wip synchronized-upgrades
Browse files Browse the repository at this point in the history
  • Loading branch information
danwinship committed Feb 17, 2022
1 parent 88b00b6 commit b465086
Showing 1 changed file with 302 additions and 0 deletions.
302 changes: 302 additions & 0 deletions enhancements/update/synchronized-upgrades.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
---
title: Synchronized Upgrades Between Clusters
authors:
- "@danwinship"
reviewers:
- "@dhellmann"
- "@sdodson"
- "@wking"
- "@romfreiman"
- "@zshi"
approvers:
- TBD
api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers)
- TBD
creation-date: 2021-01-11
last-updated: 2021-01-11
tracking-link:
- https://issues.redhat.com/browse/SDN-2603
see-also:
- "/enhancements/network/dpu/overview.md"
---

# Synchronized Upgrades Between Clusters

## Summary

Our architecture for [using DPUs (eg, BlueField-2 NICs) in
OCP](../network/dpu/overview.md) involves creating one ordinary
"tenant" OCP cluster consisting of x86 hosts, and a second "infra" OCP
cluster consisting of the ARM systems running on the NICs installed in
the x86 hosts.

In the current Dev Preview of DPU support, there is no coordination
between the infra and tenant clusters during upgrades. Thus, when
upgrading the infra cluster, each NIC will end up rebooting at some
point, after it has drained _its own_ pods, but without having made
any attempt to drain the pods of its x86 tenant node. When this
happens, the x86 node (and its pods) will lose network connectivity
until the NIC finishes rebooting.

In order to make upgrades in DPU clusters work smoothly, we need some
way to synchronize the reboots between the two clusters, so that the
NICs get rebooted during the period when their corresponding x86 hosts
are cordoned and drained.

## Glossary

(Copied from the [DPU Overview
Enhancement](../network/dpu/overview.md).)

- **DPU** / **Data Processing Unit** - a Smart NIC with a full CPU,
RAM, storage, etc, running a full operating system, able to offload
network processing and potentially other tasks from its host system.
For example, the NVIDIA BlueField-2.

- **DPU NIC** - refers to the DPU and its CPU, OS, etc (as opposed to
the CPU, OS, etc, of the x86 server that the DPU is installed in). A
**DPU NIC node** or **DPU NIC worker** is an OCP worker node
running on a DPU NIC.

- **DPU Host** - refers to an x86 server (and its CPU, OS, etc) which
contains a DPU. A **DPU Host node** or **DPU Host worker** is an OCP
worker node running on a DPU Host.

- **Two-Cluster Design** - any architecture for deploying an OCP
cluster with (some) nodes containing DPUs, where the DPUs are nodes
in a second OCP cluster. (For purposes of this term, the
architecture still counts as "two-cluster" even if HyperShift is
involved and there are actually three clusters.)

- **Infra Cluster** - in a Two-Cluster Design, the OCP cluster which
contains the DPU NIC workers. (It may also include master nodes
and/or non-DPU workers, depending on the particular details of the
two-cluster design.)

- **Tenant Cluster** - in a Two-Cluster Design, the OCP cluster
containing the DPU Host workers. (It may also include master nodes
and/or workers without DPUs.)

## Motivation

### Goals

- Make upgrades work smoothly in clusters running with DPU support, by
synchronizing the reboots of nodes between the infra cluster and the
tenant cluster.

### Non-Goals

- Supporting synchronized upgrades of more than 2 clusters at once.

## Proposal

### User Stories

As the administrator of a cluster using DPUs, I want to be able to do
y-stream and z-stream upgrades without causing unnecessary network outages.

### API Extensions

TBD

### Implementation Details/Notes/Constraints

#### Setup

We can set some things up at install time (eg, creating credentials to
allow certain operators in the two clusters to talk to each other).

#### Initiating Upgrades

As part of the DPU security model, the tenant cluster cannot have any
power over the infra cluster. In particular, it can't be possible for
an administrator in the tenant cluster to force the infra cluster to
upgrade/downgrade to any particular version.

Thus, the infra cluster upgrade must be initiated on the infra cluster
side. In theory, we could have the infra cluster then initiate the
tenant cluster upgrade, but that would require having a tenant cluster
cluster-admin admin credential lying around in the infra cluster
somewhat unnecessarily. It probably makes more sense to require the
administrator to initiate the upgrade on the tenant side as well.

#### Inter-Cluster Version Skew

There is very little explicit communication between the two clusters:

1. The dpu-operator in the infra cluster talks to the apiserver in
the tenant cluster.

2. The ovn-kubernetes components in the two clusters do not
communicate directly, but they do make some assumptions about
each other's behavior (which is necessary to coordinate the
plumbing of pod networking).

The apiserver communication is very ordinary and boring and not likely
to be subject to any interesting version skew problems (even if we
ended up with, say, a 3-minor-version skew between the clusters).

Thus, other than ovn-kubernetes, no OCP component needs to be
concerned about version skew between the two clusters, because the
components in the two clusters are completely unaware of each other.

For ovn-kubernetes, if we ever change the details of the cross-cluster
communication, then we will need to add proper checks to enforce
tolerable cross-cluster skew at that time.

## Design Details

The upgrade plan takes advantage of two things:

- Other than the reboots, no part of the upgrade needs to be
synchronized between the two clusters.

- Each infra node only runs pods to support the pods on its
corresponding tenant node. Thus, if a tenant node reboots, it is
safe to immediately reboot its infra node as well without needing
to cordon and drain it, because it is guaranteed that none of its
pods are doing anything important at that point.

So, both clusters' upgrades can proceed normally up until the MCO
upgrade.

In the infra cluster, rather than cordoning, draining, and rebooting
nodes one-by-one, the MCO will instead queue up the new RHCOS image on
each node, and then mark the node (somehow) as "waiting for a reboot".
Then the MCO will watch the nodes and wait for each one to reboot (of
its own accord), and then finally mark itself fully-upgraded after
that happens.

On the infra nodes, when they are "waiting for a reboot", they will
monitor the state of the ovn-kubernetes network; when they determine
that the tenant node is rebooting, the infra node will then also
reboot.

In terms of synchronization, this means that we want to avoid the
tenant cluster reaching the MCO upgrade stage when the infra cluster
is not ready for the tenant cluster to start rebooting nodes:

- One way to do this would be to not let a tenant cluster start an
upgrade until the infra cluster has already reached the "waiting
for reboots" stage. This could be done by having an operator in
the tenant cluster whose status is always `Upgradeable: False`
except when the infra cluster tells it that it is safe to
proceed.

- Another possibility would be to have an operator that sits just
before the MCO in the upgrade order, which checks if the infra
cluster is in the middle of an upgrade, and if so, blocks until
the infra cluster is ready for the tenant cluster to proceed. (Or
rather, the infra cluster would notify it that it should/shouldn't
block the upgrade.)

- This would allow for the possibility of doing tenant cluster
upgrades without corresponding infra cluster upgrades, which
might be useful?

### Open Questions

- How exactly will the coordination between the two MCO upgrades
occur?

- How exactly will infra nodes detect their tenant node rebooting?

- Do we want to allow (z-stream) upgrades of the tenant cluster
without also upgrading the infra cluster?

### Risks and Mitigations

The infra cluster could remain stuck at the pre-reboot stage of the
upgrade for arbitrarily long. In theory this should not cause any
problems, as all operators must be able to handle running with the old
RHCOS version anyway. But we should make sure that alerts eventually
get fired in this case.

More TBD

### Test Plan

TBD

### Graduation Criteria

TBD

#### Dev Preview -> Tech Preview

#### Tech Preview -> GA

#### Removing a deprecated feature

### Upgrade / Downgrade Strategy

This is a modification to the upgrade process, not something that can
be upgrade or downgraded on its own.

TBD, as the details depend on the eventual design.

### Version Skew Strategy

TBD, as the details depend on the eventual design.

### Operational Aspects of API Extensions

TBD

#### Failure Modes

- The system might get confused and spuriously block upgrades that
should be allowed.

- Communications failures might lead to upgrades failing without the
tenant cluster being able to figure out why they failed.

- TBD

#### Support Procedures

TBD

## Implementation History

- Initial proposal: 2021-01-11
- Updated for initial feedback: 2021-01-24

## Drawbacks

This makes the upgrade process more complicated, which risks rendering
clusters un-upgradeable without manual intervention.

However, without some form of synchronization, it is impossible to
have non-disruptive tenant cluster upgrades.

## Alternatives

### Never Reboot the DPUs

This implies never upgrading OCP on the DPUs. I don't see how this
could work.

### Don't Have an Infra Cluster

If the DPUs were not all part of a single OCP cluster (for example,
they were just "bare" RHCOS hosts, or they were each running
Single-Node OpenShift), then it might be simpler to synchronize the
DPU upgrades with the tenant upgrades, because then each tenant could
coordinate the actions of its own DPU by itself.

The big problem with this is that, for security reasons, we don't want
the tenants to have any control over their DPUs. (For some future use
cases, the DPUs will be used to enforce security policies on their
tenants.)

### More closely coordinated reboots

In the original proposal, the infra and tenant reboots were more
closely coordinated, with the infra and tenant MCOs communicating so
that the infra MCO could reboot each infra node at the same time as
the corresponding tenant node rebooted. But this (probably) turns out
to be unnecessary, as it should always be safe to just reboot the
infra nodes when their tenant nodes reboot, without needing to have
"planned" the reboot.

0 comments on commit b465086

Please sign in to comment.