UPI IP and Interface Selection

As OpenShift is deployed in increasingly complex networking environments, we have gotten many requests for more control over which interface is used for the primary node IP. We provided a basic mechanism for this with [KUBELET_NODEIP_HINT](openshift/machine-config-operator#2888) but as users have started to exercise that some significant limitations have come to light.
cybertron · Jul 8, 2022 · a4768e5 · a4768e5
1 parent 1bf47e8
commit a4768e5
Showing 1 changed file with 302 additions and 0 deletions.
diff --git a/enhancements/network/ip-interface-selection.yaml b/enhancements/network/ip-interface-selection.yaml
@@ -0,0 +1,302 @@
+---
+title: ip-interface-selection
+authors:
+  - @cybertron
+reviewers:
+  - @jcaamano
+  - @tsorya
+approvers:
+  - TBD
+api-approvers:
+  - "None"
+creation-date: 2022-07-07
+last-updated: 2022-07-07
+tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement
+  - TBD
+see-also:
+  - https://github.com/openshift/baremetal-runtimecfg/issues/119
+replaces:
+superseded-by:
+---
+
+# IP and Interface Selection
+
+## Summary
+
+As OpenShift is deployed in increasingly complex networking environments, we
+have gotten many requests for more control over which interface is used for
+the primary node IP. We provided a basic mechanism for this with
+[KUBELET_NODEIP_HINT](https://github.com/openshift/machine-config-operator/pull/2888)
+but as users have started to exercise that some significant limitations have
+come to light.
+
+## Motivation
+
+Some users want to have a great deal of control over how their network traffic
+is routed. Because we use the default route for interface and IP selection,
+in some cases they are not able to route traffic the way they want.
+
+### User Stories
+
+As a network administrator, I want cluster traffic to be on one NIC and the
+default route on a different NIC so external traffic is segregated from
+internal traffic.
+
+TODO: I'd like to get a more specific user story since I know we have a
+number of requests for this.
+
+### Goals
+
+Ensure that all host networked services on a node have consistent interface
+and IP selection.
+
+### Non-Goals
+
+Support for platforms other than UPI (None).
+
+Complete support for multiple nics, with full control over what traffic
+gets routed where. However, this work should be able to serve as the basis
+for a broader multi-nic feature so we should avoid designs that would limit
+future work in this area.
+
+## Proposal
+
+The following is a possibly incomplete list of places that we do IP/interface
+selection. Ideally these should all use a single mechanism to do that so they
+are all consistent with each other.
+
+- Node IP (Kubelet and CRIO)
+- configure-ovs
+- resolv-prepender
+- Keepalived
+- Etcd
+
+In some cases (local DNS) it may not matter if the IP selected is consistent
+with the other cases, but Node IP, configure-ovs, and Keepalived all need to
+match because their functionality depends on it. I'm less familiar with the
+requirements for Etcd, but it seems likely that should match as well. In
+general, it seems best if all IP selection logic comes from one place, whether
+it's strictly required or not.
+
+### Workflow Description
+
+At deployment time the cluster administrator will include a manifest that sets
+KUBELET_NODEIP_HINT appropriately. The nodeip-configuration service (which
+will now be set as a dependency for all other services that need IP/interface
+selection) will use that value to determine the desired IP and interface for
+all services on the node. It will write the results of this selection to a
+well-known location which the other services will consume. This way, we don't
+need to duplicate the selection logic to multiple places. It will happen once
+and be reused as necessary.
+
+#### Example
+
+- resolv-prepender has to run before any other node ip selection can take
+  place. Without resolv.conf populated the nodeip-configuration service cannot
+  pull the runtimecfg image.
+  - Note: This is not relevant for plain UPI deployments, but there are some
+    deployment mechanisms that use the IPI-style resolv.conf management. Since
+    this is not present in all deployments, we will likely need to leave this
+    as it is today. The prepender script uses the runtimecfg node-ip logic,
+    so although it won't be able to consume the output of nodeip-configuration
+    there won't be any duplicate logic either.
+- nodeip-configuration runs and selects one or more IPs. It writes them to
+  the Kubelet and CRIO configuration files (this is the existing behavior).
+- nodeip-configuration also writes the following files (new behavior):
+  - /run/nodeip-configuration/primary-ip
+  - /run/nodeip-configuration/ipv4
+  - /run/nodeip-configuration/ipv6
+  - /run/nodeip-configuration/interface
+- When configure-ovs runs, it looks for the interface file written by
+  nodeip-configuration. If found, br-ex will be created on that interface. If
+  not, the existing logic will be used.
+- When keepalived runs it will read the IP from the primary-ip file and the
+  interface from the interface file instead of [re-running the selection
+  logic](https://github.com/openshift/baremetal-runtimecfg/blob/62d6b95649882178ebc183ba4e1eecdb4cad7289/pkg/config/net.go#L11)
+- TODO: Etcd? I believe it has a copy of the IP selection logic from runtimecfg
+  so it's possible it could use these files.
+
+#### Variation [optional]
+
+We may want to rename KUBELET_NODEIP_HINT to reflect the fact that it will now
+affect more than just Kubelet.
+
+### API Extensions
+
+NA
+
+### Implementation Details/Notes/Constraints [optional]
+
+Currently configure-ovs runs before nodeip-configuration. In this design we
+would need to reverse that. There are currently no dependencies between the
+two services that would prevent such a change.
+
+As noted above, we want to make sure we don't do anything that would further
+complicate the implementation of a higher level multi-nic feature. The current
+design should not be a problem for that. For example, if at some point we add
+a feature allowing deployers to specify that they want cluster traffic on
+eth0, external traffic on eth1, and storage traffic on eth2, that feature
+would simply need to appropriately populate the KUBELET_NODEIP_HINT file that
+would be directly created by the deployer in the current design. By providing
+a common interface to configure host-networked services, this should actually
+simplify any such future enhancements.
+
+### Risks and Mitigations
+
+- There is some risk to changing the order of critical system services like
+  nodeip-configuration and configure-ovs. This will not affect deployments
+  that do not use OVNKubernetes as their CNI, but since we intend that to be
+  the default going forward it is a significant concern.
+
+  We intend to mitigate this risk by first merging the order change without
+  any additional changes included in this design. This way, if any races
+  between services are found once the change is being tested more broadly
+  it will be easy to revert.
+
+  We will also test the ordering change as much as possible before merging
+  it, but it's unlikely we can exercise it to the same degree that running
+  across hundreds of CI jobs per day will.
+
+- Currently all host services are expected to listen on the same IP and
+  interface. If at some point in the future we need host services listening
+  on multiple different interfaces, this may not work. However, because we
+  are centralizing all IP selection logic in nodeip-configuration, it should
+  be possible to extend that to handle multiple interfaces if necessary.
+
+### Drawbacks
+
+This design only considers host networking, and it's likely that in the future
+we will want a broader feature that provides an interface to configure pod
+traffic routing as well. However, if/when such a feature is implemented it
+should be able to use the same configuration interface for host services
+that deployers would use directly after this is implemented.
+
+Additionally, there are already ways to [implement traffic steering for pod
+networking.](https://youtu.be/EpbUWwjadYM) We may at some point want to
+integrate them more closely, but host networking is currently a much bigger
+pain point and worth addressing on its own.
+
+## Design Details
+
+### Open Questions [optional]
+
+- Currently this only applies to UPI clusters deployed using platform None.
+  Do we need something similar for IPI?
+
+- In UPI deployments it is also possible to set the Node IP by manually writing
+  configuration files for Kubelet and CRIO. Trying to look for all possible
+  ways a user may have configured those services seems complex and error-prone.
+  Can we just require them to use this mechanism if they want custom IP
+  selection?
+
+### Test Plan
+
+**Note:** *Section not required until targeted at a release.*
+
+Consider the following in developing a test plan for this enhancement:
+- Will there be e2e and integration tests, in addition to unit tests?
+- How will it be tested in isolation vs with other components?
+- What additional testing is necessary to support managed OpenShift service-based offerings?
+
+No need to outline all of the test cases, just the general strategy. Anything
+that would count as tricky in the implementation and anything particularly
+challenging to test should be called out.
+
+All code is expected to have adequate tests (eventually with coverage
+expectations).
+
+### Graduation Criteria
+
+NA
+
+#### Dev Preview -> Tech Preview
+
+NA
+
+#### Tech Preview -> GA
+
+NA
+
+#### Removing a deprecated feature
+
+NA
+
+### Upgrade / Downgrade Strategy
+
+NA
+
+### Version Skew Strategy
+
+From version to version the selection process must remain consistent in order
+to avoid IPs and interfaces changing. As a result, version skew should not be
+a problem.
+
+### Operational Aspects of API Extensions
+
+NA
+
+#### Failure Modes
+
+NA
+
+#### Support Procedures
+
+Describe how to
+- detect the failure modes in a support situation, describe possible symptoms (events, metrics,
+  alerts, which log output in which component)
+
+  Examples:
+  - If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz".
+  - Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed".
+  - The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")`
+    will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire.
+
+- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`)
+
+  - What consequences does it have on the cluster health?
+
+    Examples:
+    - Garbage collection in kube-controller-manager will stop working.
+    - Quota will be wrongly computed.
+    - Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data.
+      Disabling the conversion webhook will break garbage collection.
+
+  - What consequences does it have on existing, running workloads?
+
+    Examples:
+    - New namespaces won't get the finalizer "xyz" and hence might leak resource X
+      when deleted.
+    - SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod
+      communication after some minutes.
+
+  - What consequences does it have for newly created workloads?
+
+    Examples:
+    - New pods in namespace with Istio support will not get sidecars injected, breaking
+      their networking.
+
+- Does functionality fail gracefully and will work resume when re-enabled without risking
+  consistency?
+
+  Examples:
+  - The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
+    will not block the creation or updates on objects when it fails. When the
+    webhook comes back online, there is a controller reconciling all objects, applying
+    labels that were not applied during admission webhook downtime.
+  - Namespaces deletion will not delete all objects in etcd, leading to zombie
+    objects when another namespace with the same name is created.
+
+## Implementation History
+
+Major milestones in the life cycle of a proposal should be tracked in `Implementation
+History`.
+
+## Alternatives
+
+Similar to the `Drawbacks` section the `Alternatives` section is used to
+highlight and record other possible approaches to delivering the value proposed
+by an enhancement.
+
+## Infrastructure Needed [optional]
+
+NA