Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-disruptive upgrades in DPU clusters #1005

Closed

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Jan 12, 2022

In a cluster with DPUs (eg, BlueField-2 NICs), the x86 hosts form one OCP cluster, and the DPU ARM systems form a second OCP cluster. This makes upgrades to new OCP releases complicated, because there is currently no way to synchronize upgrades between the two clusters, but rebooting the BF-2 systems as part of the MCO upgrade will cause a network outage on the x86 systems. In order for upgrades to work smoothly, we need some way to synchronize the reboots between the two clusters, so that the BF-2 systems are only rebooted when their corresponding x86 hosts have been cordoned and drained ensure that tenant nodes do not get rebooted when they are running user workloads..

cc @zshi @pliurh @Billy99

authors:
- "@danwinship"
reviewers:
- TBD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdodson and someone from the OTA team like @LalatenduMohanty or @wking should be included as reviewers

## Alternatives

The fundamental problem is that rebooting the DPU causes a network
outage on the tenant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tenant cluster is not running any workloads, does that network outage cause any real issues? Could this problem be addressed by an outside orchestration tool cordoning and draining both the tenant and DPU clusters, upgrading the tenant cluster, then upgrading the DPU cluster, then removing the cordon from both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tenant cluster is the x86 hosts; it's the infra cluster (the NICs) that doesn't run user workloads.

(Probably I should copy some of the glossary in from the DPU overview doc rather than just referencing it...)

Could this problem be addressed by an outside orchestration tool cordoning and draining both the tenant and DPU clusters

That would be an even larger outage... this doesn't work for the same reason we can't do normal OCP upgrades by first cordoning and draining the entire cluster. (The tenant cluster is basically a totally ordinary OCP cluster that people are doing totally ordinary OCP things on, except for the fact that most/all of the worker nodes in the tenant cluster have NICs that also run Linux and need to be managed.)

they were just "bare" RHCOS hosts, or they were each running
Single-Node OpenShift), then it might be simpler to synchronize the
DPU upgrades with the tenant upgrades, because then each tenant could
coordinate the actions of its own DPU by itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the DPUs linked together to form a multi-node cluster across DPUs in a host? Or across multiple hosts?

Copy link
Contributor Author

@danwinship danwinship Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple hosts. Every worker in the tenant cluster has 0 or 1 DPUs (and if they have a DPU, it's their primary network interface). All of the DPUs together (plus, currently, 3 dedicated ARM masters) form the infra cluster. (Eventually HyperShift will get rid of the need for the ARM masters.)

@romfreiman
Copy link

@tsorya


### Open Questions

Basically everything...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an effort of central upgrade service happening for Telco. I assume we can reuse some bits there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got any pointers to that?

- The version to upgrade to must be available to both clusters
(ie, it must be available for both x86 and ARM).

- This could be implemented via some sort of "dpu-cluster-upgrade"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not have communications between the clusters, given the fact that the goal of the separation is security. Why not to have an external upgrade-operator (runs on the hub), that can correlate and orchestrate upgrade flows between clusters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to provide central mgmt, not a standalone system that should orchestrate itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What "hub"?

You can't completely isolate the clusters, since that would imply, eg, that the infra cluster wouldn't know anything about what pods are running in the tenant cluster, and so it wouldn't be able to route pod traffic correctly.

The goal is to make sure that in places where the clusters do need to communicate, that the infra cluster is in control, not the tenant cluster.

Eventually we can move everything to HyperShift, and maybe in that case the upgrade could be managed externally to the cluster. It should be easy to write the code so that for now the infra dpu-cluster-upgrade operator and infra MCO run in "leader" mode and the tenant dpu-cluster-upgrade operator/MCO run in "follower" mode, but in the future, they could both run in follower mode, with the leader being external.

NICs), the x86 hosts form one OCP cluster, and the DPU ARM systems
form a second OCP cluster. This makes upgrades to new OCP releases
complicated, because there is currently no way to synchronize upgrades
between the two clusters, but rebooting the BF-2 systems as part of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: overview.md link was helpful, but can we stick to terms from the glossary (like "DPU NIC") instead of using "BF-2" (a specific subset of possible BPU NICs)?

### User Stories

As the administrator of a cluster using DPUs, I want to be able to do
z-stream upgrades without causing unnecessary network outages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do minor version upgrades not benefit from this effort as well? If so what's unique about them that requires they be disruptive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this applies to y-stream upgrades too. It's just that you might be able to convince people that y-stream upgrades are allowed to have a little bit of disruption since they're rare, but you can't make that argument for z-stream upgrades.

Comment on lines 28 to 32
the MCO upgrade will cause a network outage on the x86 systems. In
order for upgrades to work smoothly, we need to synchronize the
reboots between the two clusters, so that the BF-2 systems are only
rebooted when their corresponding x86 hosts have been cordoned and
drained.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're talking about synchronizing the reboots. Does anything else need to be synchronized? Could everything on the infra cluster including SDN/OVN pods be updated ahead of host cluster SDN/OVN pods or are there other couplings to be concerned with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently (in dev preview) the only coupling between the two clusters is with ovn-kubernetes. This enhancement may result in other components becoming coupled too (MCO, "dpu-cluster-upgrade" operator, etc).

I'm just assuming there's no way we can avoid there ever being skewed versions between the clusters, and hence we will just have to cope with that (eg, any API changes between them will have to be eased in over multiple releases).

So yes, everything on the infra cluster could be updated ahead of OVN on the host cluster.

Comment on lines 128 to 133
- The Infra MCO will then upgrade the infra node (causing it to
reboot and temporarily break network connectivity to the tenant
node).

- Once the infra node upgrade completes, the Tenant MCO will
reboot and upgrade the tenant node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does rebooting the tenant node implicitly restart the infra node or does the infra node remain online during a tenant node reboot? Does this vary by either infra node or tenant node hardware or even configuration such as BIOS settings?

Copy link
Contributor Author

@danwinship danwinship Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our lab, rebooting the tenant does not restart the NIC, and AFAIK this does not vary by hardware. (ie, ACPI/EFI say that on reboot you send a PCI reset but you don't cut power to the bus.) IIRC an earlier (prerelease) version of the BF-2 did reboot on reboot (I guess because it was designed to reboot on PCI reset) but the GA ones do not. [EDIT: It appears that this was just user error, and the BF-2s never automatically rebooted.]

If there are situations where the NIC reboots on host reboot (including the possibility of other non-BF-2 DPUs in the future that behave that way), then it would only be a small change to the procedure here; instead of draining both nodes, rebooting the NIC, and then rebooting the host, we would drain both nodes, then reboot the host without explicitly rebooting the NIC.

Comment on lines 80 to 88
- As part of the DPU security model, the tenant cluster cannot have
any power over the infra cluster. (In particular, it can't be
possible for an administrator in the tenant cluster to force the
infra cluster to upgrade/downgrade to any particular version.) Thus,
the upgrade must be initiated on the infra cluster side, and the
infra side will tell the tenant cluster to upgrade as well.
(Alternatively, the upgrade must be simultaneous-ish-ly initiated in
both clusters, if we don't want the infra cluster to have to have a
credential that lets it initiate an upgrade in the tenant cluster.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, tenant node state inhibiting updates of infra node is not a violation of the security model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good point. We should think about alerts in the case where the tenant node is thwarting the infra node.

(I'm kind of assuming that any human who has cluster-admin in the infra cluster will also have cluster-admin in the tenant cluster, and so the same person would kick off the upgrade in both clusters. In the future there may be reasonable use cases where that isn't true...)

Comment on lines 137 to 140
- One way to do this would be to have a CRD with an array of hosts,
indicating the ordering, and the current status of each host, and
the two MCOs could update and watch for updates to monitor each
other's progress.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the infra and tenant nodes known which correspond to one another?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget, but it's already implemented. (They need that info to make OVN offload function.) I think currently the tenant nodes annotate their Node with the MAC of the DPU or something, and then the infra nodes can match themselves up. In the future when we have a more synchronized install system as well, then we can just figure it all out at install time.

However, without some form of synchronization, it is impossible to
have non-disruptive tenant cluster upgrades.

## Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the infra node workload only services the local tenant node then the infra cluster could update including staging machineconfig and RHCOS changes. Once that completes the tenant cluster updates and the MCO reboots tenant nodes which reboots the infra node into the updated RHCOS?

If an unexpected reboot seems like a concern perhaps the MCO on the infra node could be made to only apply the new desired configuration on a soft restart assuming rebooting the tenant triggers a soft reboot of the infra node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as above, rebooting the tenant doesn't reboot the infra node, although I suspect there's probably some way we can detect when the host reboots.

So I guess with what you're suggesting, the only coordination that would be needed would be that you have to run the upgrade to almost-completion in the infra cluster, and then you start the upgrade in the tenant cluster, and as the tenant cluster upgrade completes it will also cause the completion of the infra cluster upgrade. Right? There'd still need to be (small) MCO changes to prevent the automatic reboot in the infra cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's what I was thinking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it might not be possible for the BF-2 to directly detect a host reboot, at least when the card is in "isolated" mode.

But the infra node knows what pod-network pods are running on the tenant node, so it can infer that the tenant has been drained when (most of?) the pod-network pods go away. (If necessary, the dpu-operator could confirm this by looking at the Node object in the tenant cluster.) So we could say, if we have a reboot pending, and the tenant gets drained, then reboot. Even if we guessed wrong and the drain wasn't actually because of the MCO upgrade on the tenant, we still would at least avoid most of the disruption.

Copy link
Member

@mrunalp mrunalp Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any files shared between the two? If the tenant node can write a file that the infra node is watching/polling to detect when it is ready to reboot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading a bit more doesn't look like there is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two systems don't share a kernel (or a common CPU architecture even), so they can't share files.


- The Infra MCO will cordon and drain that host's infra node, and
the Tenant MCO will cordon and drain that host's tenant node.
(This can happen in parallel.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may impose another sync in the application pod level on which dest nodes the drained pod (infra and host) shall be placed to, since the DPU pods need to be rescheduled to the dest DPU node whose x86-host the application pods will be rescheduled to.

enhancements/update/synchronized-upgrades.md Outdated Show resolved Hide resolved
- An upgrade should not be able to start unless both clusters are able
to upgrade.

- In particular:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to mention version compatibility for ovn-k8s or rhcos (driver) during upgrade or is it too detailed to be mentioned in this enhancement?


## Design Details

### Open Questions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this apply to hypershift as well?

@danwinship danwinship force-pushed the synchronized-upgrades branch from 8033a92 to 6d2e3f9 Compare January 24, 2022 21:59
@danwinship
Copy link
Contributor Author

Updated to reflect various comments. In particular, it now uses a less-synchronized synchonization as suggested by Scott.

@danwinship danwinship force-pushed the synchronized-upgrades branch 2 times, most recently from 5b78f34 to b465086 Compare February 17, 2022 17:00
@danwinship
Copy link
Contributor Author

/cc @cgwalters @kikisdeliveryservice @sinnykumari @yuqi-zhang

Need some MCO input on this.

Someone also pointed me to https://issues.redhat.com/browse/MCO-206 which seems like it could end up being useful here...

Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a lot of questions on this (and I'm sure other team members also will once they read the proposal) since I have no background on this feature outside of the enhancement.

I really think there needs to be a conversation with the MCO as this proposes major changes to fundamental behaviors and needs some scrutiny, discussion and planning (outside of discussion on the actual enhancement PR).

cc: @craychee

So, both clusters' upgrades can proceed normally up until the MCO
upgrade.

In the infra cluster, rather than cordoning, draining, and rebooting
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont quite understand what this paragraph is intending? Doesn't seem like having a bunch of nodes half updated makes sense (which is what waiting for the reboot only would seem to indicate) since an upgrade isn't just applying a new osimageurl (and some upgrades don't have a new osimageurl). Couldn't we pause the pool something more inline with current MCO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the phrase "queue up the new image" is assuming we have openshift/machine-config-operator#1190 but unfortunately we don't today. So I'd agree we need to do a per pool/node pause approach.


### Risks and Mitigations

The infra cluster could remain stuck at the pre-reboot stage of the
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prereboot stage means that the upgrade has been performed except for the final reboot, this doesn't seem like a state we want all nodes to stay in at all? Also what happens if the update is bad? All nodes would be scheduled to make the update as opposed to a rolling mechanism that we have now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem like a state we want all nodes to stay in at all?

Yeah, note that this is under "Risks"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what happens if the update is bad? All nodes would be scheduled to make the update as opposed to a rolling mechanism that we have now.

Yeah... so if you look at the first draft of the enhancement, I had suggested a more complicated system, where the MCOs of the two clusters would agree on a particular order to upgrade nodes in, and then each individual node upgrade would be synchronized. Scott suggested that we didn't actually need that much synchronization and I rewrote the enhancement to the current form. But maybe that won't work...

@kikisdeliveryservice
Copy link
Contributor

Also cc: @aravindhp @mrunalp

@danwinship
Copy link
Contributor Author

I have a lot of questions on this (and I'm sure other team members also will once they read the proposal) since I have no background on this feature outside of the enhancement.

So, abstractly, there is a "tenant" cluster, which is more-or-less a normal OCP cluster, except that its worker nodes aren't connected directly to the network; instead, each one is stuck behind its own personal firewall that monitors and filters its network access:

  +-----------------+    +-----------------+   +-----------------+
  | Tenant Master 1 |    | Tenant Worker A |   | Tenant Worker B |  ...
  +-------+---------+    +--------+--------+   +--------+--------+
          |                       |                     |
          |                       |                     |
          |              +--------+--------+   +--------+--------+
          |              |   Firewall A    |   |   Firewall B    |  ...
          |              +--------+--------+   +--------+--------+
          |                       |                     |         
          |                       |                     |         
----------+-----------------------+---------------------+-----------

We need some way to deal with installing and managing and upgrading all of these firewall hosts, and it turns out that OCP is a good way to do that... so, we create a second OCP cluster (the "infra" cluster), and then each firewall host is actually a worker node in that cluster:


  +-----------------+    +-----------------+   +-----------------+
  | Tenant Master 1 |    | Tenant Worker A |   | Tenant Worker B |  ...
  +-------+---------+    +--------+--------+   +--------+--------+
          |                       |                     |
          |                       |                     |
          |              +--------+--------+   +--------+--------+    +----------------+
          |              | Infra Worker A  |   | Infra Worker B  |    | Infra Master 1 |  ...
          |              +--------+--------+   +--------+--------+    +------+---------+
          |                       |                     |                    |
          |                       |                     |                    |
----------+-----------------------+---------------------+--------------------+-----------

So the problem (for purposes of this enhancement) is that when we want to do an upgrade of the infra cluster, we (generally) have to reboot each of the infra workers. But when an infra worker reboots, its corresponding tenant loses network connectivity until it comes back. So the goal is: only reboot infra workers when their tenant nodes are also rebooting. (The infra nodes do not run user workloads, they only run the stuff we deploy on them, so we know it's fine to just randomly reboot them whenever we want to, as long as their tenant is idle.)

(Of course, the infra workers are not actually physically separate hosts; they're ARM systems running on the NICs of the tenant workers. But that doesn't really matter for this enhancement; the important part is just that we have to be careful when we reboot them.)

@dhellmann
Copy link
Contributor

I have a lot of questions on this (and I'm sure other team members also will once they read the proposal) since I have no background on this feature outside of the enhancement.

So, abstractly, there is a "tenant" cluster, which is more-or-less a normal OCP cluster, except that its worker nodes aren't connected directly to the network; instead, each one is stuck behind its own personal firewall that monitors and filters its network access:

Including some version of that explanation & the diagrams in the doc would be good. You could do them as ASCII art, or experiment with GitHub's new support for Mermaid (https://github.blog/2022-02-14-include-diagrams-markdown-files-mermaid/).

@derekwaynecarr
Copy link
Member

is there a way we could model this around Leases?

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/589-efficient-node-heartbeats

could the infra cluster take a lease on the tenant node, and the mco would not reboot a node with an active lease?

other related art around node maintenance holds:
https://github.com/kubevirt/node-maintenance-operator

@derekwaynecarr
Copy link
Member

just recording input from our discussion:

  • ideally we can do this with no mco changes
  • run a canary pod on the infra cluster that will cordon the tenant node when its going down
  • if the canary goes down, we should project some condition back to the tenant node for NetworkNotReady so new pods dont scheduled to that tenant node.

basically, lets not assume a shared maintenance window for the mvp.

Comment on lines 117 to 122
Thus, the infra cluster upgrade must be initiated on the infra cluster
side. In theory, we could have the infra cluster then initiate the
tenant cluster upgrade, but that would require having a tenant cluster
cluster-admin admin credential lying around in the infra cluster
somewhat unnecessarily. It probably makes more sense to require the
administrator to initiate the upgrade on the tenant side as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What credentials does the dpu-operator have? I'm guessing it's at least somewhat privileged today? Glancing at e.g. https://github.com/openshift/dpu-network-operator/blob/7750537749e734dfb6edfd25fc3e674398eef548/config/rbac/role.yaml - that's equivalent to cluster admin (although maybe only on the infra cluster).

ISTM it'd be a better user experience if the admin only needs to initiate upgrades in one place, and we can certainly design a system to do that securely.

If we don't require the admin to do this manually by default (or script it) then we don't need to be debugging failure cases where they didn't do it.

Comment on lines 155 to 159
- Each infra node only runs pods to support the pods on its
corresponding tenant node. Thus, if a tenant node reboots, it is
safe to immediately reboot its infra node as well without needing
to cordon and drain it, because it is guaranteed that none of its
pods are doing anything important at that point.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK...I have a potentially dumb question - when the tenant reboots, the infra node can (by default) just continue running? I would have thought that there'd be some platform initialization (e.g. of the PCI bus) that might break that, but is it really more that the infra node maintains electrical power and just keeps running?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am now more carefully reading earlier conversation and this was already answered at #1005 (comment)

So, both clusters' upgrades can proceed normally up until the MCO
upgrade.

In the infra cluster, rather than cordoning, draining, and rebooting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the phrase "queue up the new image" is assuming we have openshift/machine-config-operator#1190 but unfortunately we don't today. So I'd agree we need to do a per pool/node pause approach.

Comment on lines 171 to 174
On the infra nodes, when they are "waiting for a reboot", they will
monitor the state of the ovn-kubernetes network; when they determine
that the tenant node is rebooting, the infra node will then also
reboot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reboot also a drain? Or can we assume there are no non-daemonset workloads on the infra nodes, and we can just kill everything and reboot?

Comment on lines 299 to 302
the corresponding tenant node rebooted. But this (probably) turns out
to be unnecessary, as it should always be safe to just reboot the
infra nodes when their tenant nodes reboot, without needing to have
"planned" the reboot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is answering my above question - infra nodes don't need to drain (or at least, the drain should be predictable and fast).

But it also seems like it's heading long term to conflict with the above goal of doing more on the DPUs.

@danwinship danwinship force-pushed the synchronized-upgrades branch from b465086 to c31566e Compare March 2, 2022 14:40
@danwinship danwinship changed the title Synchronized upgrades in DPU clusters Non-disruptive upgrades in DPU clusters Mar 2, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 2, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from danwinship after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@danwinship
Copy link
Contributor Author

OK, updated based on call with Mrunal, Derek, and Scott; now we no longer need any MCO/CVO changes; instead, we just ensure that if an infra node is being drained (eg, by the Infra MCO as part of an infra cluster upgrade), then we also drain its corresponding tenant node, so that if/when the infra node gets rebooted, we will have drained its tenant first.

Since this no longer needs any changes in other components, I've moved the enhancement from enhancements/update/ to enhancements/network/dpu/

Comment on lines +173 to +175
Annoyingly, although we need this pod to run on every worker node, it
can't be deployed by a DaemonSet, because DaemonSet pods are not
subject to draining. Thus, we will have to have the DPU operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The daemonset draining is configurable I think; if we wanted to do that we could. See e.g. https://github.com/openshift/machine-config-operator/blob/62fb49f2d14519e215125faa6b817662cdb7ffd0/pkg/daemon/daemon.go#L316

That said, we'd probably need to change the MCO to opt-in specific daemonsets to being drained. This also relates to openshift/machine-config-operator#2978 I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And in practice, we probably do need to care about things other than the MCO doing draining, so relying on a daemonset drain is probably not viable in the near future)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice isn't "drain DaemonSets" or "don't drain DaemonSets". It's "bail out with an error if there are DaemonSets since they can't be drained" or "don't bail out with an error if there are DaemonSets". They never get drained.


So, the "drain mirroring" controller of the DPU operator will:

- Deploy a "drain blocker" pod to each infra worker node. This pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: linter has some list-indent opinions:

enhancements/network/dpu/upgrades.md:183:1 MD007/ul-indent Unordered list indentation [Expected: 0; Actual: 2]
enhancements/network/dpu/upgrades.md:188:1 MD007/ul-indent Unordered list indentation [Expected: 0; Actual: 2]
...

After some discussion, we decided we didn't need that much
synchronization.

[the original proposal]: https://github.com/openshift/enhancements/commit/8033a923
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A definite downside of the github-markdown thing we have is that a lot of important commentary and information lives only in PR review, and that earlier revisions get lost. As far as I understand things, since that commit is not part of any branch, it will be subject to GC and potentially lost.

Maybe we could handle this a bit crudely by having explicit versions, something like sticking that content at enhancements/network/dpu/upgrades-v0.md?

Dunno.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, github doesn't ever GC revisions that were commented on as part of a PR.

(Yes, there's that "I can't find those revisions anywhere" error but I think that's more about it trying to stop you from accidentally following out-of-date links.)

Copy link
Member

@wking wking Mar 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relying on GitHub's history preservation, we can just inline the relevant bits in this alternative section.

In this case "we didn't need that" is a bit fuzzy. I think the relevant bits were something like:

We realized that all we actually needed was "before I drain the Infra Node, drain its Tenant Node", and that we could achieve that fairly simply. Adding simultaneous, node-for-node matched updates was going to be extremely complicated, and didn't bring any additional value.

Comment on lines +156 to +159
Therefore, if an Infra Node is being cordoned and drained, we should
respond by cordoning and draining its Tenant Node, and not allowing
the Infra Node to be fully drained until the Tenant has also been
drained.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the MCO on the infra node is still going to drain all the other pods on the infra node except for the drain mirror pod - wouldn't that lead to disruption? Or is it that the main workload today on the infra pod is SDN daemonset and hence won't be drained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DPU NIC nodes do not currently run any non-DaemonSet pods, and that's not expected to change. So in fact, the drain-blocker pod is actually the only pod that will get drained.

(I've been imprecise with terminology here... I talk about "infra nodes" and "infra workers", but there are actually three machinesets in the infra cluster: "master", "worker", and "dpu-worker", and it's specifically the "dpu-worker" nodes we're talking about here, and they don't run any general-purpose pods. They only run pods that are specifically part of providing functionality to the corresponding DPU Host Node.)


In order to prevent MCO from rebooting an Infra node until its Tenant
has been drained, we will need to run at least one drainable pod on
each node, so that we can refuse to let that pod be drained until the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify here and say "infra node"? In general throughout the text I think we need to be very clear for type of node. Perhaps a shorthand "inode" and "tnode"? Or "infra-node" and "tenant-node"?

drained.

Given that behavior, when an Infra Cluster upgrade occurs, if it is
necessary to reboot nodes, then each node's tenant should be drained
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever we reboot a DPU, we also have to reboot the host. So that VFs can be recreated and VF representers can be discovered by the DPU. Without the VF reps, the ovnkube-node pod cannot start on the DPU.


- If the infra node does not currently have a "drain blocker"
pod, then deploy a new one to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, we need to add a step to reboot the tenant node after the DPU reboot. However, I am not quite sure how DPU can do that. One option is to add a reboot node API to MCO and let DPU invoke that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... do we actually need to reboot? It seems to me like there are two possibilities here:

  1. Reboot the tenant when we reboot the infra
  2. Make the tenant notice when the infra reboots, and properly recreate its VFs and stuff when that happens

Right? We don't have the code to do 2 now, but since we don't have the code to do 1 now either, we have to write code either way. And it seems like making the tenant more able to recover from unexpected infra problems would be a good thing in general?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if we want to do 2. Shall we also describe the logic in this doc?
The SRIOV network operator needs to be involved. We need to update the SRIOV network operator to watch PF connectivity, recreate the VFs after the infra node is back from a reboot.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2022
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 21, 2022
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Apr 28, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 28, 2022

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@danwinship
Copy link
Contributor Author

/reopen
/remove-lifecycle rotten

@openshift-ci openshift-ci bot reopened this May 3, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 3, 2022

@danwinship: Reopened this PR.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 3, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 3, 2022

@danwinship: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint c31566e link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2022
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 7, 2022
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jun 14, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 14, 2022

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tsorya
Copy link

tsorya commented Nov 10, 2022

@danwinship can we reopen enhancement ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.