diff --git a/enhancements/network/ip-interface-selection.yaml b/enhancements/network/ip-interface-selection.yaml new file mode 100644 index 00000000000..28cf2fde2e8 --- /dev/null +++ b/enhancements/network/ip-interface-selection.yaml @@ -0,0 +1,302 @@ +--- +title: ip-interface-selection +authors: + - @cybertron +reviewers: + - @jcaamano + - @tsorya +approvers: + - TBD +api-approvers: + - "None" +creation-date: 2022-07-07 +last-updated: 2022-07-07 +tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement + - TBD +see-also: + - https://github.com/openshift/baremetal-runtimecfg/issues/119 +replaces: +superseded-by: +--- + +# IP and Interface Selection + +## Summary + +As OpenShift is deployed in increasingly complex networking environments, we +have gotten many requests for more control over which interface is used for +the primary node IP. We provided a basic mechanism for this with +[KUBELET_NODEIP_HINT](https://github.com/openshift/machine-config-operator/pull/2888) +but as users have started to exercise that some significant limitations have +come to light. + +## Motivation + +Some users want to have a great deal of control over how their network traffic +is routed. Because we use the default route for interface and IP selection, +in some cases they are not able to route traffic the way they want. + +### User Stories + +As a network administrator, I want cluster traffic to be on one NIC and the +default route on a different NIC so external traffic is segregated from +internal traffic. + +TODO: I'd like to get a more specific user story since I know we have a +number of requests for this. + +### Goals + +Ensure that all host networked services on a node have consistent interface +and IP selection. + +### Non-Goals + +Support for platforms other than UPI (None). + +Complete support for multiple nics, with full control over what traffic +gets routed where. However, this work should be able to serve as the basis +for a broader multi-nic feature so we should avoid designs that would limit +future work in this area. + +## Proposal + +The following is a possibly incomplete list of places that we do IP/interface +selection. Ideally these should all use a single mechanism to do that so they +are all consistent with each other. + +- Node IP (Kubelet and CRIO) +- configure-ovs +- resolv-prepender +- Keepalived +- Etcd + +In some cases (local DNS) it may not matter if the IP selected is consistent +with the other cases, but Node IP, configure-ovs, and Keepalived all need to +match because their functionality depends on it. I'm less familiar with the +requirements for Etcd, but it seems likely that should match as well. In +general, it seems best if all IP selection logic comes from one place, whether +it's strictly required or not. + +### Workflow Description + +At deployment time the cluster administrator will include a manifest that sets +KUBELET_NODEIP_HINT appropriately. The nodeip-configuration service (which +will now be set as a dependency for all other services that need IP/interface +selection) will use that value to determine the desired IP and interface for +all services on the node. It will write the results of this selection to a +well-known location which the other services will consume. This way, we don't +need to duplicate the selection logic to multiple places. It will happen once +and be reused as necessary. + +#### Example + +- resolv-prepender has to run before any other node ip selection can take + place. Without resolv.conf populated the nodeip-configuration service cannot + pull the runtimecfg image. + - Note: This is not relevant for plain UPI deployments, but there are some + deployment mechanisms that use the IPI-style resolv.conf management. Since + this is not present in all deployments, we will likely need to leave this + as it is today. The prepender script uses the runtimecfg node-ip logic, + so although it won't be able to consume the output of nodeip-configuration + there won't be any duplicate logic either. +- nodeip-configuration runs and selects one or more IPs. It writes them to + the Kubelet and CRIO configuration files (this is the existing behavior). +- nodeip-configuration also writes the following files (new behavior): + - /run/nodeip-configuration/primary-ip + - /run/nodeip-configuration/ipv4 + - /run/nodeip-configuration/ipv6 + - /run/nodeip-configuration/interface +- When configure-ovs runs, it looks for the interface file written by + nodeip-configuration. If found, br-ex will be created on that interface. If + not, the existing logic will be used. +- When keepalived runs it will read the IP from the primary-ip file and the + interface from the interface file instead of [re-running the selection + logic](https://github.com/openshift/baremetal-runtimecfg/blob/62d6b95649882178ebc183ba4e1eecdb4cad7289/pkg/config/net.go#L11) +- TODO: Etcd? I believe it has a copy of the IP selection logic from runtimecfg + so it's possible it could use these files. + +#### Variation [optional] + +We may want to rename KUBELET_NODEIP_HINT to reflect the fact that it will now +affect more than just Kubelet. + +### API Extensions + +NA + +### Implementation Details/Notes/Constraints [optional] + +Currently configure-ovs runs before nodeip-configuration. In this design we +would need to reverse that. There are currently no dependencies between the +two services that would prevent such a change. + +As noted above, we want to make sure we don't do anything that would further +complicate the implementation of a higher level multi-nic feature. The current +design should not be a problem for that. For example, if at some point we add +a feature allowing deployers to specify that they want cluster traffic on +eth0, external traffic on eth1, and storage traffic on eth2, that feature +would simply need to appropriately populate the KUBELET_NODEIP_HINT file that +would be directly created by the deployer in the current design. By providing +a common interface to configure host-networked services, this should actually +simplify any such future enhancements. + +### Risks and Mitigations + +- There is some risk to changing the order of critical system services like + nodeip-configuration and configure-ovs. This will not affect deployments + that do not use OVNKubernetes as their CNI, but since we intend that to be + the default going forward it is a significant concern. + + We intend to mitigate this risk by first merging the order change without + any additional changes included in this design. This way, if any races + between services are found once the change is being tested more broadly + it will be easy to revert. + + We will also test the ordering change as much as possible before merging + it, but it's unlikely we can exercise it to the same degree that running + across hundreds of CI jobs per day will. + +- Currently all host services are expected to listen on the same IP and + interface. If at some point in the future we need host services listening + on multiple different interfaces, this may not work. However, because we + are centralizing all IP selection logic in nodeip-configuration, it should + be possible to extend that to handle multiple interfaces if necessary. + +### Drawbacks + +This design only considers host networking, and it's likely that in the future +we will want a broader feature that provides an interface to configure pod +traffic routing as well. However, if/when such a feature is implemented it +should be able to use the same configuration interface for host services +that deployers would use directly after this is implemented. + +Additionally, there are already ways to [implement traffic steering for pod +networking.](https://youtu.be/EpbUWwjadYM) We may at some point want to +integrate them more closely, but host networking is currently a much bigger +pain point and worth addressing on its own. + +## Design Details + +### Open Questions [optional] + +- Currently this only applies to UPI clusters deployed using platform None. + Do we need something similar for IPI? + +- In UPI deployments it is also possible to set the Node IP by manually writing + configuration files for Kubelet and CRIO. Trying to look for all possible + ways a user may have configured those services seems complex and error-prone. + Can we just require them to use this mechanism if they want custom IP + selection? + +### Test Plan + +**Note:** *Section not required until targeted at a release.* + +Consider the following in developing a test plan for this enhancement: +- Will there be e2e and integration tests, in addition to unit tests? +- How will it be tested in isolation vs with other components? +- What additional testing is necessary to support managed OpenShift service-based offerings? + +No need to outline all of the test cases, just the general strategy. Anything +that would count as tricky in the implementation and anything particularly +challenging to test should be called out. + +All code is expected to have adequate tests (eventually with coverage +expectations). + +### Graduation Criteria + +NA + +#### Dev Preview -> Tech Preview + +NA + +#### Tech Preview -> GA + +NA + +#### Removing a deprecated feature + +NA + +### Upgrade / Downgrade Strategy + +NA + +### Version Skew Strategy + +From version to version the selection process must remain consistent in order +to avoid IPs and interfaces changing. As a result, version skew should not be +a problem. + +### Operational Aspects of API Extensions + +NA + +#### Failure Modes + +NA + +#### Support Procedures + +Describe how to +- detect the failure modes in a support situation, describe possible symptoms (events, metrics, + alerts, which log output in which component) + + Examples: + - If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz". + - Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed". + - The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")` + will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire. + +- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`) + + - What consequences does it have on the cluster health? + + Examples: + - Garbage collection in kube-controller-manager will stop working. + - Quota will be wrongly computed. + - Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data. + Disabling the conversion webhook will break garbage collection. + + - What consequences does it have on existing, running workloads? + + Examples: + - New namespaces won't get the finalizer "xyz" and hence might leak resource X + when deleted. + - SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod + communication after some minutes. + + - What consequences does it have for newly created workloads? + + Examples: + - New pods in namespace with Istio support will not get sidecars injected, breaking + their networking. + +- Does functionality fail gracefully and will work resume when re-enabled without risking + consistency? + + Examples: + - The mutating admission webhook "xyz" has FailPolicy=Ignore and hence + will not block the creation or updates on objects when it fails. When the + webhook comes back online, there is a controller reconciling all objects, applying + labels that were not applied during admission webhook downtime. + - Namespaces deletion will not delete all objects in etcd, leading to zombie + objects when another namespace with the same name is created. + +## Implementation History + +Major milestones in the life cycle of a proposal should be tracked in `Implementation +History`. + +## Alternatives + +Similar to the `Drawbacks` section the `Alternatives` section is used to +highlight and record other possible approaches to delivering the value proposed +by an enhancement. + +## Infrastructure Needed [optional] + +NA