-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IP and Interface Selection #1179
IP and Interface Selection #1179
Conversation
There are a few sections still to be filled out and I think there are some more details to be worked out, but I wanted to get this posted publicly to give the other interested parties a chance to see what direction it's headed. |
|
||
### User Stories | ||
|
||
As a network administrator, I want cluster traffic to be on one NIC and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
As an example BZ where users might benefit from the feature: https://bugzilla.redhat.com/show_bug.cgi?id=2064131
-
We have customers with multiple NICs attached to their servers and they want control over which NIC to use.
-
It seems that this is a very common and important need for SNO deployments.
-
I wouldn't be surprised if this was important for OpenStack deployments with manila (= 2 interfaces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsorya @rbbratta might help fill in the user story?
Also: https://coreos.slack.com/archives/CDCP2LA9L/p1652189014050609
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In SNO deployment with 2 or more nics , we want to be sure that configure-ovs and nodeip-configuration chooses same ip, the one that user asked for, if not installation fails as router is not reachable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a customer request for dual-stack to separate IPv4 and IPv6 on different NICs, so we need to be aware of dual-stack issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another BZ.
Adding NICs post-install.
Adding multiple IP from vSphere to OCP nodes changes the node's host IP and internal IP - BZ 2070224
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, separating dual stack IPs may be out of scope for this iteration, but we should keep it in mind. That's probably another argument in favor of having this work based solely on IP and not have the interface file at all.
In some cases (local DNS) it may not matter if the IP selected is consistent | ||
with the other cases, but Node IP, configure-ovs, and Keepalived all need to | ||
match because their functionality depends on it. I'm less familiar with the | ||
requirements for Etcd, but it seems likely that should match as well. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rbbratta Had run into issues with etcd when the master nodes selected the wrong interface, so etcd is definitely some thing to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to track it down once, but not an etcd expert.
https://bugzilla.redhat.com/show_bug.cgi?id=2057160#c24
cluest-etcd-operator uses GetPreferredInternalIPAddressForNodeName
which iterates over node.Status.Addresses
?
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/dnshelpers/util.go#L40
for _, currAddress := range node.Status.Addresses {
if currAddress.Type == corev1.NodeInternalIP {
switch ipFamily {
case "tcp4":
node.Status.Addresses
is not sorted?
Is looks like kubelet nodeAddressesFunc
provides node.Status.Addresses
// if cloud is not nil, we expect the cloud resource sync manager to exist
var nodeAddressesFunc func() ([]v1.NodeAddress, error)
if kl.cloud != nil {
nodeAddressesFunc = kl.cloudResourceSyncManager.NodeAddresses
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like etcd expects stable IPs for Certs.
Even then if some component is hardcoded to connect to etcd on IP X but etcd has certificates for IP Y, no matter how they communicate from a networking standpoint, that component would always expect certificates for IP X and it won't like the etcd IP Y certificates.
Unless you reconfigure that component to communicate with etcd on IP Y or reconfigure etcd to present certificates for IP X they won't talk with each other, regardless of how the packets are routed
i.e. it's an application problem not really a networking one. The component chose some interface to find etcd on, etcd chose another interface to generate certs for, there's a mismatch (edited)
- /run/nodeip-configuration/ipv6 | ||
- /run/nodeip-configuration/interface | ||
- When configure-ovs runs, it looks for the interface file written by | ||
nodeip-configuration. If found, br-ex will be created on that interface. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Wouldn't configure-ovs be able to rely on the fact that nodeip-configuration always chooses an interface? That way, configure-ovs also doesn't have to deal with things such as stable interfaces selection ... all of that work would be done by nodeip-configuraiton, and that always. So, the "if not" case should never happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but configure-ovs still needs to be able to select an interface because of IPI deployments, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reboot nodeip-configuration will run and choose br-ex as interface but configure-ovs cleanups bridges before configuration. In that case we should not update interface file and leave old value or change configure-ovs logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodeip-configuration isn't enabled on all platforms, specifically any non-on-prem IPI ones (e.g. AWS IPI, GCP IPI). The service is only enabled for the None platform: https://github.com/openshift/machine-config-operator/blob/a2bfe8b66bf0f26d6f47f75c9e6d186f388db5f8/templates/common/_base/units/nodeip-configuration.service.yaml#L2
There's a separate service for on-prem that overwrites that one so you also get it for baremetal, vsphere, etc. The on-prem version doesn't allow overriding of the IP selection logic anyway because it's required to match the VIP, but that's okay. It's still worthwhile to make sure every other service uses the same address.
We could, of course, enable nodeip-configuration for all platforms, but I'm hesitant to mess with the platforms where things are working fine right now. If there's consensus that that is the right way to go and we are reasonably confident it won't cause problems I'm good with that option too though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a first drop i think we can use this logic only in case node-ip-hint was set. This will unblock SNO and will give sometime to get used to this new logic without breaking anything else. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, in regular BM installation we will have the same problem. kubelet ip will be set according to vip but ovs script will set something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer keeping the logic in configure-ovs simpler over limiting the impact of this change. I also probably need to re-title this since, as you noted, IPI is affected even if it can't use KUBELET_NODEIP_HINT.
**Note:** *Section not required until targeted at a release.* | ||
|
||
Consider the following in developing a test plan for this enhancement: | ||
- Will there be e2e and integration tests, in addition to unit tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need tests for both configure-ovs as well as for this entire mechanism, ideally covering some nasty corner cases. Right now, due to the absence of tests, we are constantly at risk of introducing further issues wwhile trying to fix something.
### User Stories | ||
|
||
As a network administrator, I want cluster traffic to be on one NIC and the | ||
default route on a different NIC so external traffic is segregated from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/default route/default route(s)/
There can be more that one default route, per IP stack, or also multi-homed with a metric.
https://bugzilla.redhat.com/show_bug.cgi?id=2057160
Feb 22 18:33:12 master-0 configure-ovs.sh[2775]: default via 192.168.221.1 dev enp3s0 proto dhcp metric 101
Feb 22 18:33:12 master-0 configure-ovs.sh[2775]: default via 192.168.222.1 dev enp4s0 proto dhcp metric 102
Feb 22 18:33:12 master-0 configure-ovs.sh[2775]: default via 192.168.111.1 dev enp2s0 proto dhcp metric 103
On OSP
name: master
platform:
openstack:
additionalNetworkIDs: &1
- ab3a8289-ca41-4825-a48a-8f8a11faaac7
- aa7dcdfa-2b6a-4b7a-bba9-3dabc0972bb5
OSP additionalNetworkIDs
is the only case of multi-homed IPI that I know of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this should probably say primary default route, because for the purposes of IP selection we only care about the highest priority default route. We can only select one IP/interface so the others are ignored.
I'm not at all certain how that additionalNetworkIDs feature works today. We don't support deployment across multiple subnets so all of your masters have to be in the same network. That's part of why I'm focused on UPI in this document. We don't support overriding the IP selection logic for IPI anyway, it's always based on the VIP.
|
||
### User Stories | ||
|
||
As a network administrator, I want cluster traffic to be on one NIC and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another BZ.
Adding NICs post-install.
Adding multiple IP from vSphere to OCP nodes changes the node's host IP and internal IP - BZ 2070224
In some cases (local DNS) it may not matter if the IP selected is consistent | ||
with the other cases, but Node IP, configure-ovs, and Keepalived all need to | ||
match because their functionality depends on it. I'm less familiar with the | ||
requirements for Etcd, but it seems likely that should match as well. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like etcd expects stable IPs for Certs.
Even then if some component is hardcoded to connect to etcd on IP X but etcd has certificates for IP Y, no matter how they communicate from a networking standpoint, that component would always expect certificates for IP X and it won't like the etcd IP Y certificates.
Unless you reconfigure that component to communicate with etcd on IP Y or reconfigure etcd to present certificates for IP X they won't talk with each other, regardless of how the packets are routed
i.e. it's an application problem not really a networking one. The component chose some interface to find etcd on, etcd chose another interface to generate certs for, there's a mismatch (edited)
### Open Questions [optional] | ||
|
||
- Currently this only applies to UPI clusters deployed using platform None. | ||
Do we need something similar for IPI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see comment above about OSP IPI additionalNetworkIDs
.
|
||
#### Failure Modes | ||
|
||
NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be relevant, but a common failure mode is when NetworkManager fails to activate bond0
due to DHCP issues or mis-config and the node is presented with multiple slaves interfaces in the same subnet each with their own DHCP lease and the node has multiple IPs for the first time.
There is also the case during fail_over_mac=1
where the bond0
MAC will change and potentially get a different DHCP lease.
These are both maybe DHCP network design issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's kind of tangential. If the IP changes out from under us that's a problem regardless of how we're selecting the IP.
internal traffic. | ||
|
||
TODO: I'd like to get a more specific user story since I know we have a | ||
number of requests for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ZTPFW we have a couple of similar (but yet different) use cases for this. The original one is:
There's an enclosure, which has N blades (with multiple NICs) in it that are all inter-connected through internal switches. There are 2 main networks: Internal, and External.
The internal network is meant for everything that should not be exposed to users of the enclosure. It should also be considered static. The OCP cluster will should be deployed on this network.
The external network is for all other traffic coming into the cluster, including customer workloads. This traffic is managed using MetalLB, which is installed in the OCP cluster.
As of today, it's not possible to deploy such cluster without several workarounds due to how the cluster interface, and IP are selected. One of those workarounds is to set multiple default routes and making sure the one from the external network has a lower metric. This workaround is one of the multiple needed to get the cluster deployed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks. That's a more specific use case than what I have here now.
- When keepalived runs it will read the IP from the primary-ip file and the | ||
interface from the interface file instead of [re-running the selection | ||
logic](https://github.com/openshift/baremetal-runtimecfg/blob/62d6b95649882178ebc183ba4e1eecdb4cad7289/pkg/config/net.go#L11) | ||
- TODO: Etcd? I believe it has a copy of the IP selection logic from runtimecfg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as i know etcd will generate certificates according to kubelet-ip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as i know etcd will generate certificates according to kubelet-ip
I believe this is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll make a note of that in the next revision.
- /run/nodeip-configuration/primary-ip | ||
- /run/nodeip-configuration/ipv4 | ||
- /run/nodeip-configuration/ipv6 | ||
- /run/nodeip-configuration/interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saving files in /run will create issue with ovs-configuration as nodeip will find br-ex as interface and ovs will cleanup everything as a first step and only then will try to reconfigure. Maybe we should save those files in some other place and not to override interface file if br-ex was found? or we will need to split ovs logic and cleanup before nodeip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, and now I'm really wondering if we should skip the interface part of this entirely and have it work solely on IPs. It's also a problem on first boot when the IP is still on the underlying interface. After configure-ovs runs the interface here will no longer be valid because the IP will have moved to br-ex.
Okay, thanks for all of the discussion! Based on that and my own musings, I think we need to drop the interface from this and just persist the selected IP address(es). That way if an IP gets moved we don't have to worry about stale data, and the IP isn't allowed to change on the fly (or at all) so we don't have to worry about that. Interrupts willing, I will try to get a new version of this pushed tomorrow that addresses all/most of the comments so far. |
if we want to use just ip i created 2 drafts for this, maybe it will help this works on ipv4 and allows to set any ip i want even if it doesn't have default route |
Though it opens one more question for ovn team, why we should set low metric for br-ex route that it is created. As a user i would prefer , if i know what i want, to leave same metrics we had before creating br-ex |
Yep, thanks for proposing those. I anticipate using those patches for the implementation of the design we agree on here.
Someone from SDN can correct me if I'm wrong, but I think it was to avoid conflicts with other default routes on the cluster. That might be something we need to revisit as part of supporting OVNK on an isolated interface. |
a4768e5
to
d553138
Compare
Okay, I've filled out the rest of the sections (although there are still one or two open questions) and tried to address all of the comments so far. Let me know what you think. |
looks good as for me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- @cybertron | ||
reviewers: | ||
- @jcaamano | ||
- @tsorya |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case you want/can/need add me as reviewer too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is .yaml
the correct extension for this document ? Shouldn't it be a markdown file ?
We may want to rename KUBELET_NODEIP_HINT to reflect the fact that it will now | ||
affect more than just Kubelet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The variable name should be updated to reflect its new "capabilities".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to open questions to highlight that we need to agree on a name.
Can we just require them to use this mechanism if they want custom IP | ||
selection? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're waiting for an answer to this question, right ?
Personally, it makes sense to me, but I see this as part of unifying IP and/or interface selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could proceed without making this statement, but I think we'll continue to have problems when people use other mechanisms to set the node ip.
765c959
to
5312803
Compare
Oops, you're right. I'll fix that. I added Dan and Tim as approvers since I think they're the most familiar with this. Also linked the epic we have for this work. There are two remaining open questions that we should probably answer. I briefly commented about them elsewhere, but please let me know what you think and pull in anyone you think might have a vested interest in them. |
|
||
## Alternatives | ||
|
||
Leave things basically as they are, but teach services like configure-ovs and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now user should be able to configure the network how they want, although rather undocumented. It involves:
- Deploying their own keyfiles configuring br-ex on their own and using different filenames than what configure-ovs would use
- In 4.10 release or later, disabling configure-ovs (the service itself, or touching /run/configure-ovs-boot-done)
I guess one alternative would be to refine & document this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess that in combination with KUBELET_NODEIP_HINT might handle a number of the use cases, but I feel like manual br-ex creation is a big hammer for a relatively small nail. I can add that to the alternatives though in case we decide to pursue that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like manual br-ex creation is a big hammer for a relatively small nail. I can add that to the alternatives though in case we decide to pursue that instead.
I certainly agree with this feeling. Especially because, as it is now, it overrides user's network configurations. This is counter intuitive when, for example, and nmstate config is passed through assisted service and the resulting network configuration doesn't match the desired topology.
As a deployer, I want my cluster traffic to stay on an isolated network with | ||
no external gateway. External traffic will travel on a different interface | ||
that is managed by a service like MetalLB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be this other possibility?
- running kubelet on one interface
- running the pod network on a different interface
I don't know if if it is reasonable at all, probably does not make a lot of sense. But in that it does we would not want to use nodeip-configuration decision I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my thought is that for the purposes of this enhancement we make sure everything is consistently picking the same interface, then as a followup we come up with a more complete solution that allows deployers to pick what they want to run where. I think that's going to be a lot harder to design so I think it's worth doing it in two steps.
It becomes more and more critical change for us as currently nvidia is stuck on the same issue and will be very glad to push it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@jcaamano @cybertron are there other things to discuss here?
Wondering if @danwinship and @trozet may have some spare time to review/approve this enhancement. Would love to see progress being made here as this is affecting customers and it's quite difficult to workaround.
|
||
## Alternatives | ||
|
||
Leave things basically as they are, but teach services like configure-ovs and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like manual br-ex creation is a big hammer for a relatively small nail. I can add that to the alternatives though in case we decide to pursue that instead.
I certainly agree with this feeling. Especially because, as it is now, it overrides user's network configurations. This is counter intuitive when, for example, and nmstate config is passed through assisted service and the resulting network configuration doesn't match the desired topology.
@flaper87 can you explain this a bit more? How would it override the user's network configuration if manual br-ex configuration would be the user's network configuration, one and the same? |
Oh, I think I misread your original message. I understood you were suggesting to just let the ovs script do what it does. I see now that you are saying that users, today, can create their own I do agree with @cybertron that it is a bit of a big hammer, tho. I see this approach as useful for much more complex network configurations. |
Actually, @jcaamano, following up on your comment:
Is the above even supported? As in, if a user disables the service and/or configures the |
It should work but it is not documented and not being verified so I wouldn't say it is supported. I don't think that listing alternatives in this enhancement mean that they are supported but correct me if I am wrong. |
you are not wrong. That was mostly for my own curiosity and learning since I have been dealing with this issue quite a lot lately and knowing all of the alternatives definitely helps approach this in the best way possible. @cybertron may be worth listing these alternatives and maybe we can try to get this PR through the finish line. what do you think? |
5312803
to
ef82069
Compare
Oops, forgot to add the alternative we discussed earlier. New version coming shortly with the remaining open questions answered (let me know if you disagree with my answers though) and the new alternative. I think with all that taken care of we should be good to proceed with this. |
ef82069
to
b2136d2
Compare
@cybertron markdown linter doesn't seem to agree with you :D Otherwise, lgtm |
As OpenShift is deployed in increasingly complex networking environments, we have gotten many requests for more control over which interface is used for the primary node IP. We provided a basic mechanism for this with [KUBELET_NODEIP_HINT](openshift/machine-config-operator#2888) but as users have started to exercise that some significant limitations have come to light.
b2136d2
to
67c8e0e
Compare
Turns out you need to quote the usernames or the linter complains. This one should pass (or at least it does locally). |
@cybertron: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Per the design discussed in openshift/enhancements#1179 we are going to use this hint variable for more than just Kubelet IP selection, so it makes more sense for it to have a non-service-specific name. This changes the variable reference so it will initially look for $NODEIP_HINT, and failing that it will fall back to the previous name $KUBELET_NODEIP_HINT. Note that the old name was never formally documented because it was only intended to be used in exceptional cases. This new name will be included in the official documentation, but we maintain compatibility with the old name to avoid breaking existing users on upgrade.
As OpenShift is deployed in increasingly complex networking environments, we
have gotten many requests for more control over which interface is used for
the primary node IP. We provided a basic mechanism for this with
KUBELET_NODEIP_HINT
but as users have started to exercise that some significant limitations have
come to light.