Skip to content

Commit

Permalink
UPI IP and Interface Selection
Browse files Browse the repository at this point in the history
As OpenShift is deployed in increasingly complex networking environments, we
have gotten many requests for more control over which interface is used for
the primary node IP. We provided a basic mechanism for this with
[KUBELET_NODEIP_HINT](openshift/machine-config-operator#2888)
but as users have started to exercise that some significant limitations have
come to light.
  • Loading branch information
cybertron committed Jul 8, 2022
1 parent 1bf47e8 commit a4768e5
Showing 1 changed file with 302 additions and 0 deletions.
302 changes: 302 additions & 0 deletions enhancements/network/ip-interface-selection.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
---
title: ip-interface-selection
authors:
- @cybertron
reviewers:
- @jcaamano
- @tsorya
approvers:
- TBD
api-approvers:
- "None"
creation-date: 2022-07-07
last-updated: 2022-07-07
tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement
- TBD
see-also:
- https://github.com/openshift/baremetal-runtimecfg/issues/119
replaces:
superseded-by:
---

# IP and Interface Selection

## Summary

As OpenShift is deployed in increasingly complex networking environments, we
have gotten many requests for more control over which interface is used for
the primary node IP. We provided a basic mechanism for this with
[KUBELET_NODEIP_HINT](https://github.com/openshift/machine-config-operator/pull/2888)
but as users have started to exercise that some significant limitations have
come to light.

## Motivation

Some users want to have a great deal of control over how their network traffic
is routed. Because we use the default route for interface and IP selection,
in some cases they are not able to route traffic the way they want.

### User Stories

As a network administrator, I want cluster traffic to be on one NIC and the
default route on a different NIC so external traffic is segregated from
internal traffic.

TODO: I'd like to get a more specific user story since I know we have a
number of requests for this.

### Goals

Ensure that all host networked services on a node have consistent interface
and IP selection.

### Non-Goals

Support for platforms other than UPI (None).

Complete support for multiple nics, with full control over what traffic
gets routed where. However, this work should be able to serve as the basis
for a broader multi-nic feature so we should avoid designs that would limit
future work in this area.

## Proposal

The following is a possibly incomplete list of places that we do IP/interface
selection. Ideally these should all use a single mechanism to do that so they
are all consistent with each other.

- Node IP (Kubelet and CRIO)
- configure-ovs
- resolv-prepender
- Keepalived
- Etcd

In some cases (local DNS) it may not matter if the IP selected is consistent
with the other cases, but Node IP, configure-ovs, and Keepalived all need to
match because their functionality depends on it. I'm less familiar with the
requirements for Etcd, but it seems likely that should match as well. In
general, it seems best if all IP selection logic comes from one place, whether
it's strictly required or not.

### Workflow Description

At deployment time the cluster administrator will include a manifest that sets
KUBELET_NODEIP_HINT appropriately. The nodeip-configuration service (which
will now be set as a dependency for all other services that need IP/interface
selection) will use that value to determine the desired IP and interface for
all services on the node. It will write the results of this selection to a
well-known location which the other services will consume. This way, we don't
need to duplicate the selection logic to multiple places. It will happen once
and be reused as necessary.

#### Example

- resolv-prepender has to run before any other node ip selection can take
place. Without resolv.conf populated the nodeip-configuration service cannot
pull the runtimecfg image.
- Note: This is not relevant for plain UPI deployments, but there are some
deployment mechanisms that use the IPI-style resolv.conf management. Since
this is not present in all deployments, we will likely need to leave this
as it is today. The prepender script uses the runtimecfg node-ip logic,
so although it won't be able to consume the output of nodeip-configuration
there won't be any duplicate logic either.
- nodeip-configuration runs and selects one or more IPs. It writes them to
the Kubelet and CRIO configuration files (this is the existing behavior).
- nodeip-configuration also writes the following files (new behavior):
- /run/nodeip-configuration/primary-ip
- /run/nodeip-configuration/ipv4
- /run/nodeip-configuration/ipv6
- /run/nodeip-configuration/interface
- When configure-ovs runs, it looks for the interface file written by
nodeip-configuration. If found, br-ex will be created on that interface. If
not, the existing logic will be used.
- When keepalived runs it will read the IP from the primary-ip file and the
interface from the interface file instead of [re-running the selection
logic](https://github.com/openshift/baremetal-runtimecfg/blob/62d6b95649882178ebc183ba4e1eecdb4cad7289/pkg/config/net.go#L11)
- TODO: Etcd? I believe it has a copy of the IP selection logic from runtimecfg
so it's possible it could use these files.

#### Variation [optional]

We may want to rename KUBELET_NODEIP_HINT to reflect the fact that it will now
affect more than just Kubelet.

### API Extensions

NA

### Implementation Details/Notes/Constraints [optional]

Currently configure-ovs runs before nodeip-configuration. In this design we
would need to reverse that. There are currently no dependencies between the
two services that would prevent such a change.

As noted above, we want to make sure we don't do anything that would further
complicate the implementation of a higher level multi-nic feature. The current
design should not be a problem for that. For example, if at some point we add
a feature allowing deployers to specify that they want cluster traffic on
eth0, external traffic on eth1, and storage traffic on eth2, that feature
would simply need to appropriately populate the KUBELET_NODEIP_HINT file that
would be directly created by the deployer in the current design. By providing
a common interface to configure host-networked services, this should actually
simplify any such future enhancements.

### Risks and Mitigations

- There is some risk to changing the order of critical system services like
nodeip-configuration and configure-ovs. This will not affect deployments
that do not use OVNKubernetes as their CNI, but since we intend that to be
the default going forward it is a significant concern.

We intend to mitigate this risk by first merging the order change without
any additional changes included in this design. This way, if any races
between services are found once the change is being tested more broadly
it will be easy to revert.

We will also test the ordering change as much as possible before merging
it, but it's unlikely we can exercise it to the same degree that running
across hundreds of CI jobs per day will.

- Currently all host services are expected to listen on the same IP and
interface. If at some point in the future we need host services listening
on multiple different interfaces, this may not work. However, because we
are centralizing all IP selection logic in nodeip-configuration, it should
be possible to extend that to handle multiple interfaces if necessary.

### Drawbacks

This design only considers host networking, and it's likely that in the future
we will want a broader feature that provides an interface to configure pod
traffic routing as well. However, if/when such a feature is implemented it
should be able to use the same configuration interface for host services
that deployers would use directly after this is implemented.

Additionally, there are already ways to [implement traffic steering for pod
networking.](https://youtu.be/EpbUWwjadYM) We may at some point want to
integrate them more closely, but host networking is currently a much bigger
pain point and worth addressing on its own.

## Design Details

### Open Questions [optional]

- Currently this only applies to UPI clusters deployed using platform None.
Do we need something similar for IPI?

- In UPI deployments it is also possible to set the Node IP by manually writing
configuration files for Kubelet and CRIO. Trying to look for all possible
ways a user may have configured those services seems complex and error-prone.
Can we just require them to use this mechanism if they want custom IP
selection?

### Test Plan

**Note:** *Section not required until targeted at a release.*

Consider the following in developing a test plan for this enhancement:
- Will there be e2e and integration tests, in addition to unit tests?
- How will it be tested in isolation vs with other components?
- What additional testing is necessary to support managed OpenShift service-based offerings?

No need to outline all of the test cases, just the general strategy. Anything
that would count as tricky in the implementation and anything particularly
challenging to test should be called out.

All code is expected to have adequate tests (eventually with coverage
expectations).

### Graduation Criteria

NA

#### Dev Preview -> Tech Preview

NA

#### Tech Preview -> GA

NA

#### Removing a deprecated feature

NA

### Upgrade / Downgrade Strategy

NA

### Version Skew Strategy

From version to version the selection process must remain consistent in order
to avoid IPs and interfaces changing. As a result, version skew should not be
a problem.

### Operational Aspects of API Extensions

NA

#### Failure Modes

NA

#### Support Procedures

Describe how to
- detect the failure modes in a support situation, describe possible symptoms (events, metrics,
alerts, which log output in which component)

Examples:
- If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz".
- Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed".
- The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")`
will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire.

- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`)

- What consequences does it have on the cluster health?

Examples:
- Garbage collection in kube-controller-manager will stop working.
- Quota will be wrongly computed.
- Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data.
Disabling the conversion webhook will break garbage collection.

- What consequences does it have on existing, running workloads?

Examples:
- New namespaces won't get the finalizer "xyz" and hence might leak resource X
when deleted.
- SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod
communication after some minutes.

- What consequences does it have for newly created workloads?

Examples:
- New pods in namespace with Istio support will not get sidecars injected, breaking
their networking.

- Does functionality fail gracefully and will work resume when re-enabled without risking
consistency?

Examples:
- The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
will not block the creation or updates on objects when it fails. When the
webhook comes back online, there is a controller reconciling all objects, applying
labels that were not applied during admission webhook downtime.
- Namespaces deletion will not delete all objects in etcd, leading to zombie
objects when another namespace with the same name is created.

## Implementation History

Major milestones in the life cycle of a proposal should be tracked in `Implementation
History`.

## Alternatives

Similar to the `Drawbacks` section the `Alternatives` section is used to
highlight and record other possible approaches to delivering the value proposed
by an enhancement.

## Infrastructure Needed [optional]

NA

0 comments on commit a4768e5

Please sign in to comment.