-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-disruptive upgrades in DPU clusters #1005
Conversation
authors: | ||
- "@danwinship" | ||
reviewers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdodson and someone from the OTA team like @LalatenduMohanty or @wking should be included as reviewers
## Alternatives | ||
|
||
The fundamental problem is that rebooting the DPU causes a network | ||
outage on the tenant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the tenant cluster is not running any workloads, does that network outage cause any real issues? Could this problem be addressed by an outside orchestration tool cordoning and draining both the tenant and DPU clusters, upgrading the tenant cluster, then upgrading the DPU cluster, then removing the cordon from both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tenant cluster is the x86 hosts; it's the infra cluster (the NICs) that doesn't run user workloads.
(Probably I should copy some of the glossary in from the DPU overview doc rather than just referencing it...)
Could this problem be addressed by an outside orchestration tool cordoning and draining both the tenant and DPU clusters
That would be an even larger outage... this doesn't work for the same reason we can't do normal OCP upgrades by first cordoning and draining the entire cluster. (The tenant cluster is basically a totally ordinary OCP cluster that people are doing totally ordinary OCP things on, except for the fact that most/all of the worker nodes in the tenant cluster have NICs that also run Linux and need to be managed.)
they were just "bare" RHCOS hosts, or they were each running | ||
Single-Node OpenShift), then it might be simpler to synchronize the | ||
DPU upgrades with the tenant upgrades, because then each tenant could | ||
coordinate the actions of its own DPU by itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the DPUs linked together to form a multi-node cluster across DPUs in a host? Or across multiple hosts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple hosts. Every worker in the tenant cluster has 0 or 1 DPUs (and if they have a DPU, it's their primary network interface). All of the DPUs together (plus, currently, 3 dedicated ARM masters) form the infra cluster. (Eventually HyperShift will get rid of the need for the ARM masters.)
|
||
### Open Questions | ||
|
||
Basically everything... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an effort of central upgrade service happening for Telco. I assume we can reuse some bits there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got any pointers to that?
- The version to upgrade to must be available to both clusters | ||
(ie, it must be available for both x86 and ARM). | ||
|
||
- This could be implemented via some sort of "dpu-cluster-upgrade" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not have communications between the clusters, given the fact that the goal of the separation is security. Why not to have an external upgrade-operator (runs on the hub), that can correlate and orchestrate upgrade flows between clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to provide central mgmt, not a standalone system that should orchestrate itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What "hub"?
You can't completely isolate the clusters, since that would imply, eg, that the infra cluster wouldn't know anything about what pods are running in the tenant cluster, and so it wouldn't be able to route pod traffic correctly.
The goal is to make sure that in places where the clusters do need to communicate, that the infra cluster is in control, not the tenant cluster.
Eventually we can move everything to HyperShift, and maybe in that case the upgrade could be managed externally to the cluster. It should be easy to write the code so that for now the infra dpu-cluster-upgrade operator and infra MCO run in "leader" mode and the tenant dpu-cluster-upgrade operator/MCO run in "follower" mode, but in the future, they could both run in follower mode, with the leader being external.
NICs), the x86 hosts form one OCP cluster, and the DPU ARM systems | ||
form a second OCP cluster. This makes upgrades to new OCP releases | ||
complicated, because there is currently no way to synchronize upgrades | ||
between the two clusters, but rebooting the BF-2 systems as part of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: overview.md
link was helpful, but can we stick to terms from the glossary (like "DPU NIC") instead of using "BF-2" (a specific subset of possible BPU NICs)?
### User Stories | ||
|
||
As the administrator of a cluster using DPUs, I want to be able to do | ||
z-stream upgrades without causing unnecessary network outages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do minor version upgrades not benefit from this effort as well? If so what's unique about them that requires they be disruptive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this applies to y-stream upgrades too. It's just that you might be able to convince people that y-stream upgrades are allowed to have a little bit of disruption since they're rare, but you can't make that argument for z-stream upgrades.
the MCO upgrade will cause a network outage on the x86 systems. In | ||
order for upgrades to work smoothly, we need to synchronize the | ||
reboots between the two clusters, so that the BF-2 systems are only | ||
rebooted when their corresponding x86 hosts have been cordoned and | ||
drained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're talking about synchronizing the reboots. Does anything else need to be synchronized? Could everything on the infra cluster including SDN/OVN pods be updated ahead of host cluster SDN/OVN pods or are there other couplings to be concerned with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently (in dev preview) the only coupling between the two clusters is with ovn-kubernetes. This enhancement may result in other components becoming coupled too (MCO, "dpu-cluster-upgrade" operator, etc).
I'm just assuming there's no way we can avoid there ever being skewed versions between the clusters, and hence we will just have to cope with that (eg, any API changes between them will have to be eased in over multiple releases).
So yes, everything on the infra cluster could be updated ahead of OVN on the host cluster.
- The Infra MCO will then upgrade the infra node (causing it to | ||
reboot and temporarily break network connectivity to the tenant | ||
node). | ||
|
||
- Once the infra node upgrade completes, the Tenant MCO will | ||
reboot and upgrade the tenant node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does rebooting the tenant node implicitly restart the infra node or does the infra node remain online during a tenant node reboot? Does this vary by either infra node or tenant node hardware or even configuration such as BIOS settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our lab, rebooting the tenant does not restart the NIC, and AFAIK this does not vary by hardware. (ie, ACPI/EFI say that on reboot you send a PCI reset but you don't cut power to the bus.) IIRC an earlier (prerelease) version of the BF-2 did reboot on reboot (I guess because it was designed to reboot on PCI reset) but the GA ones do not. [EDIT: It appears that this was just user error, and the BF-2s never automatically rebooted.]
If there are situations where the NIC reboots on host reboot (including the possibility of other non-BF-2 DPUs in the future that behave that way), then it would only be a small change to the procedure here; instead of draining both nodes, rebooting the NIC, and then rebooting the host, we would drain both nodes, then reboot the host without explicitly rebooting the NIC.
- As part of the DPU security model, the tenant cluster cannot have | ||
any power over the infra cluster. (In particular, it can't be | ||
possible for an administrator in the tenant cluster to force the | ||
infra cluster to upgrade/downgrade to any particular version.) Thus, | ||
the upgrade must be initiated on the infra cluster side, and the | ||
infra side will tell the tenant cluster to upgrade as well. | ||
(Alternatively, the upgrade must be simultaneous-ish-ly initiated in | ||
both clusters, if we don't want the infra cluster to have to have a | ||
credential that lets it initiate an upgrade in the tenant cluster.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, tenant node state inhibiting updates of infra node is not a violation of the security model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good point. We should think about alerts in the case where the tenant node is thwarting the infra node.
(I'm kind of assuming that any human who has cluster-admin in the infra cluster will also have cluster-admin in the tenant cluster, and so the same person would kick off the upgrade in both clusters. In the future there may be reasonable use cases where that isn't true...)
- One way to do this would be to have a CRD with an array of hosts, | ||
indicating the ordering, and the current status of each host, and | ||
the two MCOs could update and watch for updates to monitor each | ||
other's progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do the infra and tenant nodes known which correspond to one another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget, but it's already implemented. (They need that info to make OVN offload function.) I think currently the tenant nodes annotate their Node with the MAC of the DPU or something, and then the infra nodes can match themselves up. In the future when we have a more synchronized install system as well, then we can just figure it all out at install time.
However, without some form of synchronization, it is impossible to | ||
have non-disruptive tenant cluster upgrades. | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the infra node workload only services the local tenant node then the infra cluster could update including staging machineconfig and RHCOS changes. Once that completes the tenant cluster updates and the MCO reboots tenant nodes which reboots the infra node into the updated RHCOS?
If an unexpected reboot seems like a concern perhaps the MCO on the infra node could be made to only apply the new desired configuration on a soft restart assuming rebooting the tenant triggers a soft reboot of the infra node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as above, rebooting the tenant doesn't reboot the infra node, although I suspect there's probably some way we can detect when the host reboots.
So I guess with what you're suggesting, the only coordination that would be needed would be that you have to run the upgrade to almost-completion in the infra cluster, and then you start the upgrade in the tenant cluster, and as the tenant cluster upgrade completes it will also cause the completion of the infra cluster upgrade. Right? There'd still need to be (small) MCO changes to prevent the automatic reboot in the infra cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's what I was thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it might not be possible for the BF-2 to directly detect a host reboot, at least when the card is in "isolated" mode.
But the infra node knows what pod-network pods are running on the tenant node, so it can infer that the tenant has been drained when (most of?) the pod-network pods go away. (If necessary, the dpu-operator could confirm this by looking at the Node object in the tenant cluster.) So we could say, if we have a reboot pending, and the tenant gets drained, then reboot. Even if we guessed wrong and the drain wasn't actually because of the MCO upgrade on the tenant, we still would at least avoid most of the disruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any files shared between the two? If the tenant node can write a file that the infra node is watching/polling to detect when it is ready to reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading a bit more doesn't look like there is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two systems don't share a kernel (or a common CPU architecture even), so they can't share files.
|
||
- The Infra MCO will cordon and drain that host's infra node, and | ||
the Tenant MCO will cordon and drain that host's tenant node. | ||
(This can happen in parallel.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may impose another sync in the application pod level on which dest nodes the drained pod (infra and host) shall be placed to, since the DPU pods need to be rescheduled to the dest DPU node whose x86-host the application pods will be rescheduled to.
- An upgrade should not be able to start unless both clusters are able | ||
to upgrade. | ||
|
||
- In particular: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to mention version compatibility for ovn-k8s or rhcos (driver) during upgrade or is it too detailed to be mentioned in this enhancement?
|
||
## Design Details | ||
|
||
### Open Questions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this apply to hypershift as well?
8033a92
to
6d2e3f9
Compare
Updated to reflect various comments. In particular, it now uses a less-synchronized synchonization as suggested by Scott. |
5b78f34
to
b465086
Compare
/cc @cgwalters @kikisdeliveryservice @sinnykumari @yuqi-zhang Need some MCO input on this. Someone also pointed me to https://issues.redhat.com/browse/MCO-206 which seems like it could end up being useful here... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a lot of questions on this (and I'm sure other team members also will once they read the proposal) since I have no background on this feature outside of the enhancement.
I really think there needs to be a conversation with the MCO as this proposes major changes to fundamental behaviors and needs some scrutiny, discussion and planning (outside of discussion on the actual enhancement PR).
cc: @craychee
So, both clusters' upgrades can proceed normally up until the MCO | ||
upgrade. | ||
|
||
In the infra cluster, rather than cordoning, draining, and rebooting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont quite understand what this paragraph is intending? Doesn't seem like having a bunch of nodes half updated makes sense (which is what waiting for the reboot only would seem to indicate) since an upgrade isn't just applying a new osimageurl (and some upgrades don't have a new osimageurl). Couldn't we pause the pool something more inline with current MCO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the phrase "queue up the new image" is assuming we have openshift/machine-config-operator#1190 but unfortunately we don't today. So I'd agree we need to do a per pool/node pause approach.
|
||
### Risks and Mitigations | ||
|
||
The infra cluster could remain stuck at the pre-reboot stage of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prereboot stage means that the upgrade has been performed except for the final reboot, this doesn't seem like a state we want all nodes to stay in at all? Also what happens if the update is bad? All nodes would be scheduled to make the update as opposed to a rolling mechanism that we have now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem like a state we want all nodes to stay in at all?
Yeah, note that this is under "Risks"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what happens if the update is bad? All nodes would be scheduled to make the update as opposed to a rolling mechanism that we have now.
Yeah... so if you look at the first draft of the enhancement, I had suggested a more complicated system, where the MCOs of the two clusters would agree on a particular order to upgrade nodes in, and then each individual node upgrade would be synchronized. Scott suggested that we didn't actually need that much synchronization and I rewrote the enhancement to the current form. But maybe that won't work...
Also cc: @aravindhp @mrunalp |
So, abstractly, there is a "tenant" cluster, which is more-or-less a normal OCP cluster, except that its worker nodes aren't connected directly to the network; instead, each one is stuck behind its own personal firewall that monitors and filters its network access:
We need some way to deal with installing and managing and upgrading all of these firewall hosts, and it turns out that OCP is a good way to do that... so, we create a second OCP cluster (the "infra" cluster), and then each firewall host is actually a worker node in that cluster:
So the problem (for purposes of this enhancement) is that when we want to do an upgrade of the infra cluster, we (generally) have to reboot each of the infra workers. But when an infra worker reboots, its corresponding tenant loses network connectivity until it comes back. So the goal is: only reboot infra workers when their tenant nodes are also rebooting. (The infra nodes do not run user workloads, they only run the stuff we deploy on them, so we know it's fine to just randomly reboot them whenever we want to, as long as their tenant is idle.) (Of course, the infra workers are not actually physically separate hosts; they're ARM systems running on the NICs of the tenant workers. But that doesn't really matter for this enhancement; the important part is just that we have to be careful when we reboot them.) |
Including some version of that explanation & the diagrams in the doc would be good. You could do them as ASCII art, or experiment with GitHub's new support for Mermaid (https://github.blog/2022-02-14-include-diagrams-markdown-files-mermaid/). |
is there a way we could model this around https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/589-efficient-node-heartbeats could the infra cluster take a lease on the tenant node, and the mco would not reboot a node with an active lease? other related art around node maintenance holds: |
just recording input from our discussion:
basically, lets not assume a shared maintenance window for the mvp. |
Thus, the infra cluster upgrade must be initiated on the infra cluster | ||
side. In theory, we could have the infra cluster then initiate the | ||
tenant cluster upgrade, but that would require having a tenant cluster | ||
cluster-admin admin credential lying around in the infra cluster | ||
somewhat unnecessarily. It probably makes more sense to require the | ||
administrator to initiate the upgrade on the tenant side as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What credentials does the dpu-operator have? I'm guessing it's at least somewhat privileged today? Glancing at e.g. https://github.com/openshift/dpu-network-operator/blob/7750537749e734dfb6edfd25fc3e674398eef548/config/rbac/role.yaml - that's equivalent to cluster admin (although maybe only on the infra cluster).
ISTM it'd be a better user experience if the admin only needs to initiate upgrades in one place, and we can certainly design a system to do that securely.
If we don't require the admin to do this manually by default (or script it) then we don't need to be debugging failure cases where they didn't do it.
- Each infra node only runs pods to support the pods on its | ||
corresponding tenant node. Thus, if a tenant node reboots, it is | ||
safe to immediately reboot its infra node as well without needing | ||
to cordon and drain it, because it is guaranteed that none of its | ||
pods are doing anything important at that point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK...I have a potentially dumb question - when the tenant reboots, the infra node can (by default) just continue running? I would have thought that there'd be some platform initialization (e.g. of the PCI bus) that might break that, but is it really more that the infra node maintains electrical power and just keeps running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am now more carefully reading earlier conversation and this was already answered at #1005 (comment)
So, both clusters' upgrades can proceed normally up until the MCO | ||
upgrade. | ||
|
||
In the infra cluster, rather than cordoning, draining, and rebooting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the phrase "queue up the new image" is assuming we have openshift/machine-config-operator#1190 but unfortunately we don't today. So I'd agree we need to do a per pool/node pause approach.
On the infra nodes, when they are "waiting for a reboot", they will | ||
monitor the state of the ovn-kubernetes network; when they determine | ||
that the tenant node is rebooting, the infra node will then also | ||
reboot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this reboot also a drain? Or can we assume there are no non-daemonset workloads on the infra nodes, and we can just kill everything and reboot?
the corresponding tenant node rebooted. But this (probably) turns out | ||
to be unnecessary, as it should always be safe to just reboot the | ||
infra nodes when their tenant nodes reboot, without needing to have | ||
"planned" the reboot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this is answering my above question - infra nodes don't need to drain (or at least, the drain should be predictable and fast).
But it also seems like it's heading long term to conflict with the above goal of doing more on the DPUs.
b465086
to
c31566e
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
OK, updated based on call with Mrunal, Derek, and Scott; now we no longer need any MCO/CVO changes; instead, we just ensure that if an infra node is being drained (eg, by the Infra MCO as part of an infra cluster upgrade), then we also drain its corresponding tenant node, so that if/when the infra node gets rebooted, we will have drained its tenant first. Since this no longer needs any changes in other components, I've moved the enhancement from |
Annoyingly, although we need this pod to run on every worker node, it | ||
can't be deployed by a DaemonSet, because DaemonSet pods are not | ||
subject to draining. Thus, we will have to have the DPU operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The daemonset draining is configurable I think; if we wanted to do that we could. See e.g. https://github.com/openshift/machine-config-operator/blob/62fb49f2d14519e215125faa6b817662cdb7ffd0/pkg/daemon/daemon.go#L316
That said, we'd probably need to change the MCO to opt-in specific daemonsets to being drained. This also relates to openshift/machine-config-operator#2978 I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And in practice, we probably do need to care about things other than the MCO doing draining, so relying on a daemonset drain is probably not viable in the near future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice isn't "drain DaemonSets" or "don't drain DaemonSets". It's "bail out with an error if there are DaemonSets since they can't be drained" or "don't bail out with an error if there are DaemonSets". They never get drained.
|
||
So, the "drain mirroring" controller of the DPU operator will: | ||
|
||
- Deploy a "drain blocker" pod to each infra worker node. This pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: linter has some list-indent opinions:
enhancements/network/dpu/upgrades.md:183:1 MD007/ul-indent Unordered list indentation [Expected: 0; Actual: 2]
enhancements/network/dpu/upgrades.md:188:1 MD007/ul-indent Unordered list indentation [Expected: 0; Actual: 2]
...
After some discussion, we decided we didn't need that much | ||
synchronization. | ||
|
||
[the original proposal]: https://github.com/openshift/enhancements/commit/8033a923 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A definite downside of the github-markdown thing we have is that a lot of important commentary and information lives only in PR review, and that earlier revisions get lost. As far as I understand things, since that commit is not part of any branch, it will be subject to GC and potentially lost.
Maybe we could handle this a bit crudely by having explicit versions, something like sticking that content at enhancements/network/dpu/upgrades-v0.md
?
Dunno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, github doesn't ever GC revisions that were commented on as part of a PR.
(Yes, there's that "I can't find those revisions anywhere" error but I think that's more about it trying to stop you from accidentally following out-of-date links.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of relying on GitHub's history preservation, we can just inline the relevant bits in this alternative section.
In this case "we didn't need that" is a bit fuzzy. I think the relevant bits were something like:
We realized that all we actually needed was "before I drain the Infra Node, drain its Tenant Node", and that we could achieve that fairly simply. Adding simultaneous, node-for-node matched updates was going to be extremely complicated, and didn't bring any additional value.
Therefore, if an Infra Node is being cordoned and drained, we should | ||
respond by cordoning and draining its Tenant Node, and not allowing | ||
the Infra Node to be fully drained until the Tenant has also been | ||
drained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the MCO on the infra node is still going to drain all the other pods on the infra node except for the drain mirror pod - wouldn't that lead to disruption? Or is it that the main workload today on the infra pod is SDN daemonset and hence won't be drained?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DPU NIC nodes do not currently run any non-DaemonSet pods, and that's not expected to change. So in fact, the drain-blocker pod is actually the only pod that will get drained.
(I've been imprecise with terminology here... I talk about "infra nodes" and "infra workers", but there are actually three machinesets in the infra cluster: "master", "worker", and "dpu-worker", and it's specifically the "dpu-worker" nodes we're talking about here, and they don't run any general-purpose pods. They only run pods that are specifically part of providing functionality to the corresponding DPU Host Node.)
|
||
In order to prevent MCO from rebooting an Infra node until its Tenant | ||
has been drained, we will need to run at least one drainable pod on | ||
each node, so that we can refuse to let that pod be drained until the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clarify here and say "infra node"? In general throughout the text I think we need to be very clear for type of node. Perhaps a shorthand "inode" and "tnode"? Or "infra-node" and "tenant-node"?
drained. | ||
|
||
Given that behavior, when an Infra Cluster upgrade occurs, if it is | ||
necessary to reboot nodes, then each node's tenant should be drained |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever we reboot a DPU, we also have to reboot the host. So that VFs can be recreated and VF representers can be discovered by the DPU. Without the VF reps, the ovnkube-node pod cannot start on the DPU.
|
||
- If the infra node does not currently have a "drain blocker" | ||
pod, then deploy a new one to it. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, we need to add a step to reboot the tenant node after the DPU reboot. However, I am not quite sure how DPU can do that. One option is to add a reboot node API to MCO and let DPU invoke that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... do we actually need to reboot? It seems to me like there are two possibilities here:
- Reboot the tenant when we reboot the infra
- Make the tenant notice when the infra reboots, and properly recreate its VFs and stuff when that happens
Right? We don't have the code to do 2 now, but since we don't have the code to do 1 now either, we have to write code either way. And it seems like making the tenant more able to recover from unexpected infra problems would be a good thing in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if we want to do 2. Shall we also describe the logic in this doc?
The SRIOV network operator needs to be involved. We need to update the SRIOV network operator to watch PF connectivity, recreate the VFs after the infra node is back from a reboot.
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@danwinship: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@danwinship: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@danwinship can we reopen enhancement ? |
In a cluster with DPUs (eg, BlueField-2 NICs), the x86 hosts form one OCP cluster, and the DPU ARM systems form a second OCP cluster. This makes upgrades to new OCP releases complicated, because there is currently no way to synchronize upgrades between the two clusters, but rebooting the BF-2 systems as part of the MCO upgrade will cause a network outage on the x86 systems. In order for upgrades to work smoothly, we need some way to
synchronize the reboots between the two clusters, so that the BF-2 systems are only rebooted when their corresponding x86 hosts have been cordoned and drainedensure that tenant nodes do not get rebooted when they are running user workloads..cc @zshi @pliurh @Billy99