Non-disruptive upgrades in DPU clusters #1005

danwinship · 2022-01-12T19:01:08Z

In a cluster with DPUs (eg, BlueField-2 NICs), the x86 hosts form one OCP cluster, and the DPU ARM systems form a second OCP cluster. This makes upgrades to new OCP releases complicated, because there is currently no way to synchronize upgrades between the two clusters, but rebooting the BF-2 systems as part of the MCO upgrade will cause a network outage on the x86 systems. In order for upgrades to work smoothly, we need some way to ~~synchronize the reboots between the two clusters, so that the BF-2 systems are only rebooted when their corresponding x86 hosts have been cordoned and drained~~ ensure that tenant nodes do not get rebooted when they are running user workloads..

cc @zshi @pliurh @Billy99

dhellmann · 2022-01-12T21:27:09Z

enhancements/update/synchronized-upgrades.md

+authors:
+  - "@danwinship"
+reviewers:
+  - TBD


@sdodson and someone from the OTA team like @LalatenduMohanty or @wking should be included as reviewers

dhellmann · 2022-01-12T21:30:44Z

enhancements/update/synchronized-upgrades.md

+## Alternatives
+
+The fundamental problem is that rebooting the DPU causes a network
+outage on the tenant.


If the tenant cluster is not running any workloads, does that network outage cause any real issues? Could this problem be addressed by an outside orchestration tool cordoning and draining both the tenant and DPU clusters, upgrading the tenant cluster, then upgrading the DPU cluster, then removing the cordon from both?

The tenant cluster is the x86 hosts; it's the infra cluster (the NICs) that doesn't run user workloads.

(Probably I should copy some of the glossary in from the DPU overview doc rather than just referencing it...)

Could this problem be addressed by an outside orchestration tool cordoning and draining both the tenant and DPU clusters

That would be an even larger outage... this doesn't work for the same reason we can't do normal OCP upgrades by first cordoning and draining the entire cluster. (The tenant cluster is basically a totally ordinary OCP cluster that people are doing totally ordinary OCP things on, except for the fact that most/all of the worker nodes in the tenant cluster have NICs that also run Linux and need to be managed.)

dhellmann · 2022-01-12T21:32:19Z

enhancements/update/synchronized-upgrades.md

+they were just "bare" RHCOS hosts, or they were each running
+Single-Node OpenShift), then it might be simpler to synchronize the
+DPU upgrades with the tenant upgrades, because then each tenant could
+coordinate the actions of its own DPU by itself.


Are the DPUs linked together to form a multi-node cluster across DPUs in a host? Or across multiple hosts?

Multiple hosts. Every worker in the tenant cluster has 0 or 1 DPUs (and if they have a DPU, it's their primary network interface). All of the DPUs together (plus, currently, 3 dedicated ARM masters) form the infra cluster. (Eventually HyperShift will get rid of the need for the ARM masters.)

romfreiman · 2022-01-12T22:10:19Z

@tsorya

romfreiman · 2022-01-12T22:16:19Z

enhancements/update/synchronized-upgrades.md

+
+### Open Questions
+
+Basically everything...


There is an effort of central upgrade service happening for Telco. I assume we can reuse some bits there.

got any pointers to that?

romfreiman · 2022-01-12T22:18:25Z

enhancements/update/synchronized-upgrades.md

+    - The version to upgrade to must be available to both clusters
+      (ie, it must be available for both x86 and ARM).
+
+  - This could be implemented via some sort of "dpu-cluster-upgrade"


I would not have communications between the clusters, given the fact that the goal of the separation is security. Why not to have an external upgrade-operator (runs on the hub), that can correlate and orchestrate upgrade flows between clusters.

The goal is to provide central mgmt, not a standalone system that should orchestrate itself.

What "hub"?

You can't completely isolate the clusters, since that would imply, eg, that the infra cluster wouldn't know anything about what pods are running in the tenant cluster, and so it wouldn't be able to route pod traffic correctly.

The goal is to make sure that in places where the clusters do need to communicate, that the infra cluster is in control, not the tenant cluster.

Eventually we can move everything to HyperShift, and maybe in that case the upgrade could be managed externally to the cluster. It should be easy to write the code so that for now the infra dpu-cluster-upgrade operator and infra MCO run in "leader" mode and the tenant dpu-cluster-upgrade operator/MCO run in "follower" mode, but in the future, they could both run in follower mode, with the leader being external.

wking · 2022-01-14T18:34:29Z

enhancements/update/synchronized-upgrades.md

+NICs), the x86 hosts form one OCP cluster, and the DPU ARM systems
+form a second OCP cluster. This makes upgrades to new OCP releases
+complicated, because there is currently no way to synchronize upgrades
+between the two clusters, but rebooting the BF-2 systems as part of


nit: overview.md link was helpful, but can we stick to terms from the glossary (like "DPU NIC") instead of using "BF-2" (a specific subset of possible BPU NICs)?

sdodson · 2022-01-14T18:53:08Z

enhancements/update/synchronized-upgrades.md

+### User Stories
+
+As the administrator of a cluster using DPUs, I want to be able to do
+z-stream upgrades without causing unnecessary network outages.


Do minor version upgrades not benefit from this effort as well? If so what's unique about them that requires they be disruptive?

No, this applies to y-stream upgrades too. It's just that you might be able to convince people that y-stream upgrades are allowed to have a little bit of disruption since they're rare, but you can't make that argument for z-stream upgrades.

sdodson · 2022-01-14T19:08:04Z

enhancements/update/synchronized-upgrades.md

+the MCO upgrade will cause a network outage on the x86 systems. In
+order for upgrades to work smoothly, we need to synchronize the
+reboots between the two clusters, so that the BF-2 systems are only
+rebooted when their corresponding x86 hosts have been cordoned and
+drained.


Here we're talking about synchronizing the reboots. Does anything else need to be synchronized? Could everything on the infra cluster including SDN/OVN pods be updated ahead of host cluster SDN/OVN pods or are there other couplings to be concerned with?

Currently (in dev preview) the only coupling between the two clusters is with ovn-kubernetes. This enhancement may result in other components becoming coupled too (MCO, "dpu-cluster-upgrade" operator, etc).

I'm just assuming there's no way we can avoid there ever being skewed versions between the clusters, and hence we will just have to cope with that (eg, any API changes between them will have to be eased in over multiple releases).

So yes, everything on the infra cluster could be updated ahead of OVN on the host cluster.

sdodson · 2022-01-14T20:14:40Z

enhancements/update/synchronized-upgrades.md

+    - The Infra MCO will then upgrade the infra node (causing it to
+      reboot and temporarily break network connectivity to the tenant
+      node).
+
+    - Once the infra node upgrade completes, the Tenant MCO will
+      reboot and upgrade the tenant node.


Does rebooting the tenant node implicitly restart the infra node or does the infra node remain online during a tenant node reboot? Does this vary by either infra node or tenant node hardware or even configuration such as BIOS settings?

In our lab, rebooting the tenant does not restart the NIC, and AFAIK this does not vary by hardware. (ie, ACPI/EFI say that on reboot you send a PCI reset but you don't cut power to the bus.) ~~IIRC an earlier (prerelease) version of the BF-2 did reboot on reboot (I guess because it was designed to reboot on PCI reset) but the GA ones do not.~~ [EDIT: It appears that this was just user error, and the BF-2s never automatically rebooted.]

If there are situations where the NIC reboots on host reboot (including the possibility of other non-BF-2 DPUs in the future that behave that way), then it would only be a small change to the procedure here; instead of draining both nodes, rebooting the NIC, and then rebooting the host, we would drain both nodes, then reboot the host without explicitly rebooting the NIC.

sdodson · 2022-01-14T20:15:54Z

enhancements/update/synchronized-upgrades.md

+- As part of the DPU security model, the tenant cluster cannot have
+  any power over the infra cluster. (In particular, it can't be
+  possible for an administrator in the tenant cluster to force the
+  infra cluster to upgrade/downgrade to any particular version.) Thus,
+  the upgrade must be initiated on the infra cluster side, and the
+  infra side will tell the tenant cluster to upgrade as well.
+  (Alternatively, the upgrade must be simultaneous-ish-ly initiated in
+  both clusters, if we don't want the infra cluster to have to have a
+  credential that lets it initiate an upgrade in the tenant cluster.)


However, tenant node state inhibiting updates of infra node is not a violation of the security model?

A good point. We should think about alerts in the case where the tenant node is thwarting the infra node.

(I'm kind of assuming that any human who has cluster-admin in the infra cluster will also have cluster-admin in the tenant cluster, and so the same person would kick off the upgrade in both clusters. In the future there may be reasonable use cases where that isn't true...)

sdodson · 2022-01-14T20:20:14Z

enhancements/update/synchronized-upgrades.md

+  - One way to do this would be to have a CRD with an array of hosts,
+    indicating the ordering, and the current status of each host, and
+    the two MCOs could update and watch for updates to monitor each
+    other's progress.


How do the infra and tenant nodes known which correspond to one another?

I forget, but it's already implemented. (They need that info to make OVN offload function.) I think currently the tenant nodes annotate their Node with the MAC of the DPU or something, and then the infra nodes can match themselves up. In the future when we have a more synchronized install system as well, then we can just figure it all out at install time.

sdodson · 2022-01-14T20:32:43Z

enhancements/update/synchronized-upgrades.md

+However, without some form of synchronization, it is impossible to
+have non-disruptive tenant cluster upgrades.
+
+## Alternatives


If the infra node workload only services the local tenant node then the infra cluster could update including staging machineconfig and RHCOS changes. Once that completes the tenant cluster updates and the MCO reboots tenant nodes which reboots the infra node into the updated RHCOS?

If an unexpected reboot seems like a concern perhaps the MCO on the infra node could be made to only apply the new desired configuration on a soft restart assuming rebooting the tenant triggers a soft reboot of the infra node?

So as above, rebooting the tenant doesn't reboot the infra node, although I suspect there's probably some way we can detect when the host reboots.

So I guess with what you're suggesting, the only coordination that would be needed would be that you have to run the upgrade to almost-completion in the infra cluster, and then you start the upgrade in the tenant cluster, and as the tenant cluster upgrade completes it will also cause the completion of the infra cluster upgrade. Right? There'd still need to be (small) MCO changes to prevent the automatic reboot in the infra cluster.

Yeah that's what I was thinking.

It looks like it might not be possible for the BF-2 to directly detect a host reboot, at least when the card is in "isolated" mode.

But the infra node knows what pod-network pods are running on the tenant node, so it can infer that the tenant has been drained when (most of?) the pod-network pods go away. (If necessary, the dpu-operator could confirm this by looking at the Node object in the tenant cluster.) So we could say, if we have a reboot pending, and the tenant gets drained, then reboot. Even if we guessed wrong and the drain wasn't actually because of the MCO upgrade on the tenant, we still would at least avoid most of the disruption.

Are there any files shared between the two? If the tenant node can write a file that the infra node is watching/polling to detect when it is ready to reboot.

Reading a bit more doesn't look like there is.

The two systems don't share a kernel (or a common CPU architecture even), so they can't share files.

zshi-redhat · 2022-01-24T04:05:24Z

enhancements/update/synchronized-upgrades.md

+
+    - The Infra MCO will cordon and drain that host's infra node, and
+      the Tenant MCO will cordon and drain that host's tenant node.
+      (This can happen in parallel.)


This may impose another sync in the application pod level on which dest nodes the drained pod (infra and host) shall be placed to, since the DPU pods need to be rescheduled to the dest DPU node whose x86-host the application pods will be rescheduled to.

enhancements/update/synchronized-upgrades.md

zshi-redhat · 2022-01-24T04:14:14Z

enhancements/update/synchronized-upgrades.md

+- An upgrade should not be able to start unless both clusters are able
+  to upgrade.
+
+  - In particular:


Do we want to mention version compatibility for ovn-k8s or rhcos (driver) during upgrade or is it too detailed to be mentioned in this enhancement?

zshi-redhat · 2022-01-24T04:18:13Z

enhancements/update/synchronized-upgrades.md

+
+## Design Details
+
+### Open Questions


Will this apply to hypershift as well?

danwinship · 2022-01-24T22:00:16Z

Updated to reflect various comments. In particular, it now uses a less-synchronized synchonization as suggested by Scott.

danwinship · 2022-02-17T17:00:42Z

/cc @cgwalters @kikisdeliveryservice @sinnykumari @yuqi-zhang

Need some MCO input on this.

Someone also pointed me to https://issues.redhat.com/browse/MCO-206 which seems like it could end up being useful here...

kikisdeliveryservice

I have a lot of questions on this (and I'm sure other team members also will once they read the proposal) since I have no background on this feature outside of the enhancement.

I really think there needs to be a conversation with the MCO as this proposes major changes to fundamental behaviors and needs some scrutiny, discussion and planning (outside of discussion on the actual enhancement PR).

cc: @craychee

kikisdeliveryservice · 2022-02-17T21:19:13Z

enhancements/update/synchronized-upgrades.md

+So, both clusters' upgrades can proceed normally up until the MCO
+upgrade.
+
+In the infra cluster, rather than cordoning, draining, and rebooting


I dont quite understand what this paragraph is intending? Doesn't seem like having a bunch of nodes half updated makes sense (which is what waiting for the reboot only would seem to indicate) since an upgrade isn't just applying a new osimageurl (and some upgrades don't have a new osimageurl). Couldn't we pause the pool something more inline with current MCO.

I think the phrase "queue up the new image" is assuming we have openshift/machine-config-operator#1190 but unfortunately we don't today. So I'd agree we need to do a per pool/node pause approach.

kikisdeliveryservice · 2022-02-17T21:23:50Z

enhancements/update/synchronized-upgrades.md

+
+### Risks and Mitigations
+
+The infra cluster could remain stuck at the pre-reboot stage of the


The prereboot stage means that the upgrade has been performed except for the final reboot, this doesn't seem like a state we want all nodes to stay in at all? Also what happens if the update is bad? All nodes would be scheduled to make the update as opposed to a rolling mechanism that we have now.

this doesn't seem like a state we want all nodes to stay in at all?

Yeah, note that this is under "Risks"

Also what happens if the update is bad? All nodes would be scheduled to make the update as opposed to a rolling mechanism that we have now.

Yeah... so if you look at the first draft of the enhancement, I had suggested a more complicated system, where the MCOs of the two clusters would agree on a particular order to upgrade nodes in, and then each individual node upgrade would be synchronized. Scott suggested that we didn't actually need that much synchronization and I rewrote the enhancement to the current form. But maybe that won't work...

kikisdeliveryservice · 2022-02-17T21:45:45Z

Also cc: @aravindhp @mrunalp

danwinship · 2022-02-18T18:49:31Z

I have a lot of questions on this (and I'm sure other team members also will once they read the proposal) since I have no background on this feature outside of the enhancement.

So, abstractly, there is a "tenant" cluster, which is more-or-less a normal OCP cluster, except that its worker nodes aren't connected directly to the network; instead, each one is stuck behind its own personal firewall that monitors and filters its network access:

  +-----------------+    +-----------------+   +-----------------+
  | Tenant Master 1 |    | Tenant Worker A |   | Tenant Worker B |  ...
  +-------+---------+    +--------+--------+   +--------+--------+
          |                       |                     |
          |                       |                     |
          |              +--------+--------+   +--------+--------+
          |              |   Firewall A    |   |   Firewall B    |  ...
          |              +--------+--------+   +--------+--------+
          |                       |                     |         
          |                       |                     |         
----------+-----------------------+---------------------+-----------

We need some way to deal with installing and managing and upgrading all of these firewall hosts, and it turns out that OCP is a good way to do that... so, we create a second OCP cluster (the "infra" cluster), and then each firewall host is actually a worker node in that cluster:


  +-----------------+    +-----------------+   +-----------------+
  | Tenant Master 1 |    | Tenant Worker A |   | Tenant Worker B |  ...
  +-------+---------+    +--------+--------+   +--------+--------+
          |                       |                     |
          |                       |                     |
          |              +--------+--------+   +--------+--------+    +----------------+
          |              | Infra Worker A  |   | Infra Worker B  |    | Infra Master 1 |  ...
          |              +--------+--------+   +--------+--------+    +------+---------+
          |                       |                     |                    |
          |                       |                     |                    |
----------+-----------------------+---------------------+--------------------+-----------

So the problem (for purposes of this enhancement) is that when we want to do an upgrade of the infra cluster, we (generally) have to reboot each of the infra workers. But when an infra worker reboots, its corresponding tenant loses network connectivity until it comes back. So the goal is: only reboot infra workers when their tenant nodes are also rebooting. (The infra nodes do not run user workloads, they only run the stuff we deploy on them, so we know it's fine to just randomly reboot them whenever we want to, as long as their tenant is idle.)

(Of course, the infra workers are not actually physically separate hosts; they're ARM systems running on the NICs of the tenant workers. But that doesn't really matter for this enhancement; the important part is just that we have to be careful when we reboot them.)

dhellmann · 2022-02-22T19:28:26Z

I have a lot of questions on this (and I'm sure other team members also will once they read the proposal) since I have no background on this feature outside of the enhancement.

So, abstractly, there is a "tenant" cluster, which is more-or-less a normal OCP cluster, except that its worker nodes aren't connected directly to the network; instead, each one is stuck behind its own personal firewall that monitors and filters its network access:

Including some version of that explanation & the diagrams in the doc would be good. You could do them as ASCII art, or experiment with GitHub's new support for Mermaid (https://github.blog/2022-02-14-include-diagrams-markdown-files-mermaid/).

derekwaynecarr · 2022-02-23T20:49:17Z

is there a way we could model this around Leases?

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/589-efficient-node-heartbeats

could the infra cluster take a lease on the tenant node, and the mco would not reboot a node with an active lease?

other related art around node maintenance holds:
https://github.com/kubevirt/node-maintenance-operator

derekwaynecarr · 2022-02-23T21:38:01Z

just recording input from our discussion:

ideally we can do this with no mco changes
run a canary pod on the infra cluster that will cordon the tenant node when its going down
if the canary goes down, we should project some condition back to the tenant node for NetworkNotReady so new pods dont scheduled to that tenant node.

basically, lets not assume a shared maintenance window for the mvp.

cgwalters · 2022-02-24T15:40:09Z

enhancements/update/synchronized-upgrades.md

+Thus, the infra cluster upgrade must be initiated on the infra cluster
+side. In theory, we could have the infra cluster then initiate the
+tenant cluster upgrade, but that would require having a tenant cluster
+cluster-admin admin credential lying around in the infra cluster
+somewhat unnecessarily. It probably makes more sense to require the
+administrator to initiate the upgrade on the tenant side as well.


What credentials does the dpu-operator have? I'm guessing it's at least somewhat privileged today? Glancing at e.g. https://github.com/openshift/dpu-network-operator/blob/7750537749e734dfb6edfd25fc3e674398eef548/config/rbac/role.yaml - that's equivalent to cluster admin (although maybe only on the infra cluster).

ISTM it'd be a better user experience if the admin only needs to initiate upgrades in one place, and we can certainly design a system to do that securely.

If we don't require the admin to do this manually by default (or script it) then we don't need to be debugging failure cases where they didn't do it.

cgwalters · 2022-02-24T15:44:01Z

enhancements/update/synchronized-upgrades.md

+- Each infra node only runs pods to support the pods on its
+  corresponding tenant node. Thus, if a tenant node reboots, it is
+  safe to immediately reboot its infra node as well without needing
+  to cordon and drain it, because it is guaranteed that none of its
+  pods are doing anything important at that point.


OK...I have a potentially dumb question - when the tenant reboots, the infra node can (by default) just continue running? I would have thought that there'd be some platform initialization (e.g. of the PCI bus) that might break that, but is it really more that the infra node maintains electrical power and just keeps running?

Sorry, I am now more carefully reading earlier conversation and this was already answered at #1005 (comment)

cgwalters · 2022-02-24T15:49:33Z

enhancements/update/synchronized-upgrades.md

+So, both clusters' upgrades can proceed normally up until the MCO
+upgrade.
+
+In the infra cluster, rather than cordoning, draining, and rebooting


I think the phrase "queue up the new image" is assuming we have openshift/machine-config-operator#1190 but unfortunately we don't today. So I'd agree we need to do a per pool/node pause approach.

cgwalters · 2022-02-24T15:52:48Z

enhancements/update/synchronized-upgrades.md

+On the infra nodes, when they are "waiting for a reboot", they will
+monitor the state of the ovn-kubernetes network; when they determine
+that the tenant node is rebooting, the infra node will then also
+reboot.


Is this reboot also a drain? Or can we assume there are no non-daemonset workloads on the infra nodes, and we can just kill everything and reboot?

cgwalters · 2022-02-24T15:55:55Z

enhancements/update/synchronized-upgrades.md

+the corresponding tenant node rebooted. But this (probably) turns out
+to be unnecessary, as it should always be safe to just reboot the
+infra nodes when their tenant nodes reboot, without needing to have
+"planned" the reboot.


OK this is answering my above question - infra nodes don't need to drain (or at least, the drain should be predictable and fast).

But it also seems like it's heading long term to conflict with the above goal of doing more on the DPUs.

openshift-ci · 2022-03-02T14:40:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from danwinship after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

enhancements/network/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

danwinship · 2022-03-02T14:44:21Z

OK, updated based on call with Mrunal, Derek, and Scott; now we no longer need any MCO/CVO changes; instead, we just ensure that if an infra node is being drained (eg, by the Infra MCO as part of an infra cluster upgrade), then we also drain its corresponding tenant node, so that if/when the infra node gets rebooted, we will have drained its tenant first.

Since this no longer needs any changes in other components, I've moved the enhancement from enhancements/update/ to enhancements/network/dpu/

cgwalters · 2022-03-02T15:20:31Z

enhancements/network/dpu/upgrades.md

+Annoyingly, although we need this pod to run on every worker node, it
+can't be deployed by a DaemonSet, because DaemonSet pods are not
+subject to draining. Thus, we will have to have the DPU operator


The daemonset draining is configurable I think; if we wanted to do that we could. See e.g. https://github.com/openshift/machine-config-operator/blob/62fb49f2d14519e215125faa6b817662cdb7ffd0/pkg/daemon/daemon.go#L316

That said, we'd probably need to change the MCO to opt-in specific daemonsets to being drained. This also relates to openshift/machine-config-operator#2978 I think.

(And in practice, we probably do need to care about things other than the MCO doing draining, so relying on a daemonset drain is probably not viable in the near future)

The choice isn't "drain DaemonSets" or "don't drain DaemonSets". It's "bail out with an error if there are DaemonSets since they can't be drained" or "don't bail out with an error if there are DaemonSets". They never get drained.

wking · 2022-03-02T23:37:33Z

enhancements/network/dpu/upgrades.md

+
+So, the "drain mirroring" controller of the DPU operator will:
+
+  - Deploy a "drain blocker" pod to each infra worker node. This pod


nit: linter has some list-indent opinions:

enhancements/network/dpu/upgrades.md:183:1 MD007/ul-indent Unordered list indentation [Expected: 0; Actual: 2] enhancements/network/dpu/upgrades.md:188:1 MD007/ul-indent Unordered list indentation [Expected: 0; Actual: 2] ...

cgwalters · 2022-03-03T14:56:34Z

enhancements/network/dpu/upgrades.md

+After some discussion, we decided we didn't need that much
+synchronization.
+
+[the original proposal]: https://github.com/openshift/enhancements/commit/8033a923


A definite downside of the github-markdown thing we have is that a lot of important commentary and information lives only in PR review, and that earlier revisions get lost. As far as I understand things, since that commit is not part of any branch, it will be subject to GC and potentially lost.

Maybe we could handle this a bit crudely by having explicit versions, something like sticking that content at enhancements/network/dpu/upgrades-v0.md?

Dunno.

In my experience, github doesn't ever GC revisions that were commented on as part of a PR.

(Yes, there's that "I can't find those revisions anywhere" error but I think that's more about it trying to stop you from accidentally following out-of-date links.)

Instead of relying on GitHub's history preservation, we can just inline the relevant bits in this alternative section.

In this case "we didn't need that" is a bit fuzzy. I think the relevant bits were something like:

We realized that all we actually needed was "before I drain the Infra Node, drain its Tenant Node", and that we could achieve that fairly simply. Adding simultaneous, node-for-node matched updates was going to be extremely complicated, and didn't bring any additional value.

cgwalters · 2022-03-03T14:58:49Z

enhancements/network/dpu/upgrades.md

+Therefore, if an Infra Node is being cordoned and drained, we should
+respond by cordoning and draining its Tenant Node, and not allowing
+the Infra Node to be fully drained until the Tenant has also been
+drained.


But the MCO on the infra node is still going to drain all the other pods on the infra node except for the drain mirror pod - wouldn't that lead to disruption? Or is it that the main workload today on the infra pod is SDN daemonset and hence won't be drained?

DPU NIC nodes do not currently run any non-DaemonSet pods, and that's not expected to change. So in fact, the drain-blocker pod is actually the only pod that will get drained.

(I've been imprecise with terminology here... I talk about "infra nodes" and "infra workers", but there are actually three machinesets in the infra cluster: "master", "worker", and "dpu-worker", and it's specifically the "dpu-worker" nodes we're talking about here, and they don't run any general-purpose pods. They only run pods that are specifically part of providing functionality to the corresponding DPU Host Node.)

cgwalters · 2022-03-03T14:59:08Z

enhancements/network/dpu/upgrades.md

+
+In order to prevent MCO from rebooting an Infra node until its Tenant
+has been drained, we will need to run at least one drainable pod on
+each node, so that we can refuse to let that pod be drained until the


Let's clarify here and say "infra node"? In general throughout the text I think we need to be very clear for type of node. Perhaps a shorthand "inode" and "tnode"? Or "infra-node" and "tenant-node"?

pliurh · 2022-03-04T06:34:31Z

enhancements/network/dpu/upgrades.md

+drained.
+
+Given that behavior, when an Infra Cluster upgrade occurs, if it is
+necessary to reboot nodes, then each node's tenant should be drained


Whenever we reboot a DPU, we also have to reboot the host. So that VFs can be recreated and VF representers can be discovered by the DPU. Without the VF reps, the ovnkube-node pod cannot start on the DPU.

pliurh · 2022-03-04T06:44:15Z

enhancements/network/dpu/upgrades.md

+
+      - If the infra node does not currently have a "drain blocker"
+        pod, then deploy a new one to it.
+


As I mentioned above, we need to add a step to reboot the tenant node after the DPU reboot. However, I am not quite sure how DPU can do that. One option is to add a reboot node API to MCO and let DPU invoke that.

Hm... do we actually need to reboot? It seems to me like there are two possibilities here:

Reboot the tenant when we reboot the infra

Make the tenant notice when the infra reboots, and properly recreate its VFs and stuff when that happens

Right? We don't have the code to do 2 now, but since we don't have the code to do 1 now either, we have to write code either way. And it seems like making the tenant more able to recover from unexpected infra problems would be a good thing in general?

Ok, if we want to do 2. Shall we also describe the logic in this doc?
The SRIOV network operator needs to be involved. We need to update the SRIOV network operator to watch PF connectivity, recreate the VFs after the infra node is back from a reboot.

openshift-bot · 2022-04-14T12:28:50Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-04-21T12:33:25Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2022-04-28T18:33:01Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-04-28T18:33:28Z

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

danwinship · 2022-05-03T12:32:31Z

/reopen
/remove-lifecycle rotten

openshift-ci · 2022-05-03T12:34:44Z

@danwinship: Reopened this PR.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-05-03T12:40:45Z

@danwinship: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`c31566e`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2022-05-31T13:09:59Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-06-07T13:16:44Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2022-06-14T19:10:49Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-06-14T19:12:26Z

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tsorya · 2022-11-10T12:07:54Z

@danwinship can we reopen enhancement ?

openshift-ci bot requested review from jottofar and LalatenduMohanty January 12, 2022 19:03

dhellmann reviewed Jan 12, 2022

View reviewed changes

romfreiman reviewed Jan 12, 2022

View reviewed changes

wking reviewed Jan 14, 2022

View reviewed changes

sdodson reviewed Jan 14, 2022

View reviewed changes

zshi-redhat reviewed Jan 24, 2022

View reviewed changes

danwinship force-pushed the synchronized-upgrades branch from 8033a92 to 6d2e3f9 Compare January 24, 2022 21:59

danwinship force-pushed the synchronized-upgrades branch 2 times, most recently from 5b78f34 to b465086 Compare February 17, 2022 17:00

openshift-ci bot requested review from cgwalters, kikisdeliveryservice, sinnykumari and yuqi-zhang February 17, 2022 17:00

kikisdeliveryservice reviewed Feb 17, 2022

View reviewed changes

cgwalters reviewed Feb 24, 2022

View reviewed changes

initial dpu upgrades enhancement

c31566e

danwinship force-pushed the synchronized-upgrades branch from b465086 to c31566e Compare March 2, 2022 14:40

danwinship changed the title ~~Synchronized upgrades in DPU clusters~~ Non-disruptive upgrades in DPU clusters Mar 2, 2022

cgwalters reviewed Mar 2, 2022

View reviewed changes

wking reviewed Mar 2, 2022

View reviewed changes

cgwalters reviewed Mar 3, 2022

View reviewed changes

pliurh reviewed Mar 4, 2022

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2022

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 21, 2022

openshift-ci bot closed this Apr 28, 2022

openshift-ci bot reopened this May 3, 2022

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 3, 2022

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2022

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 7, 2022

openshift-ci bot closed this Jun 14, 2022


		### Risks and Mitigations

		The infra cluster could remain stuck at the pre-reboot stage of the


		So, the "drain mirroring" controller of the DPU operator will:

		- Deploy a "drain blocker" pod to each infra worker node. This pod


		- If the infra node does not currently have a "drain blocker"
		pod, then deploy a new one to it.

Non-disruptive upgrades in DPU clusters #1005

Non-disruptive upgrades in DPU clusters #1005

Conversation

danwinship commented Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danwinship Jan 13, 2022 • edited Loading

Choose a reason for hiding this comment

romfreiman commented Jan 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danwinship Jan 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrunalp Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danwinship commented Jan 24, 2022

danwinship commented Feb 17, 2022

kikisdeliveryservice left a comment • edited Loading

Choose a reason for hiding this comment

kikisdeliveryservice Feb 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kikisdeliveryservice Feb 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kikisdeliveryservice commented Feb 17, 2022

danwinship commented Feb 18, 2022

dhellmann commented Feb 22, 2022

derekwaynecarr commented Feb 23, 2022

derekwaynecarr commented Feb 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Mar 2, 2022

danwinship commented Mar 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wking Mar 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Apr 14, 2022

openshift-bot commented Apr 21, 2022

danwinship commented Jan 12, 2022 •

edited

Loading

danwinship Jan 13, 2022 •

edited

Loading

danwinship Jan 14, 2022 •

edited

Loading

mrunalp Feb 23, 2022 •

edited

Loading

kikisdeliveryservice left a comment •

edited

Loading

kikisdeliveryservice Feb 17, 2022 •

edited

Loading

kikisdeliveryservice Feb 17, 2022 •

edited

Loading

wking Mar 4, 2022 •

edited

Loading