Skip to content

Latest commit

 

History

History
707 lines (607 loc) · 40.1 KB

podaffinity.md

File metadata and controls

707 lines (607 loc) · 40.1 KB

WARNING WARNING WARNING WARNING WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

If you are using a released version of Kubernetes, you should refer to the docs that go with that version.

The latest release of this document can be found [here](http://releases.k8s.io/release-1.4/docs/design/podaffinity.md).

Documentation for other releases can be found at releases.k8s.io.

Inter-pod topological affinity and anti-affinity

Introduction

NOTE: It is useful to read about node affinity first.

This document describes a proposal for specifying and implementing inter-pod topological affinity and anti-affinity. By that we mean: rules that specify that certain pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.) as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.

Here are a few example rules; we explain how to express them using the API described in this doc later, in the section "Examples."

  • Affinity
    • Co-locate the pods from a particular service or Job in the same availability zone, without specifying which zone that should be.
    • Co-locate the pods from service S1 with pods from service S2 because S1 uses S2 and thus it is useful to minimize the network latency between them. Co-location might mean same nodes and/or same availability zone.
  • Anti-affinity
    • Spread the pods of a service across nodes and/or availability zones, e.g. to reduce correlated failures.
    • Give a pod "exclusive" access to a node to guarantee resource isolation -- it must never share the node with other pods.
    • Don't schedule the pods of a particular service on the same nodes as pods of another service that are known to interfere with the performance of the pods of the first service.

For both affinity and anti-affinity, there are three variants. Two variants have the property of requiring the affinity/anti-affinity to be satisfied for the pod to be allowed to schedule onto a node; the difference between them is that if the condition ceases to be met later on at runtime, for one of them the system will try to eventually evict the pod, while for the other the system may not try to do so. The third variant simply provides scheduling-time hints that the scheduler will try to satisfy but may not be able to. These three variants are directly analogous to the three variants of node affinity.

Note that this proposal is only about inter-pod topological affinity and anti-affinity. There are other forms of topological affinity and anti-affinity. For example, you can use node affinity to require (prefer) that a set of pods all be scheduled in some specific zone Z. Node affinity is not capable of expressing inter-pod dependencies, and conversely the API we describe in this document is not capable of expressing node affinity rules. For simplicity, we will use the terms "affinity" and "anti-affinity" to mean "inter-pod topological affinity" and "inter-pod topological anti-affinity," respectively, in the remainder of this document.

API

We will add one field to PodSpec

Affinity *Affinity  `json:"affinity,omitempty"`

The Affinity type is defined as follows

type Affinity struct {
    PodAffinity     *PodAffinity  `json:"podAffinity,omitempty"`
    PodAntiAffinity *PodAntiAffinity  `json:"podAntiAffinity,omitempty"`
}

type PodAffinity struct {
    // If the affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system will try to eventually evict the pod from its node.
    // When there are multiple elements, the lists of nodes corresponding to each
    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
    RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
    // If the affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system may or may not try to eventually evict the pod from its node.
    // When there are multiple elements, the lists of nodes corresponding to each
    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
    RequiredDuringSchedulingIgnoredDuringExecution  []PodAffinityTerm  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
    // The scheduler will prefer to schedule pods to nodes that satisfy
    // the affinity expressions specified by this field, but it may choose
    // a node that violates one or more of the expressions. The node that is
    // most preferred is the one with the greatest sum of weights, i.e.
    // for each node that meets all of the scheduling requirements (resource
    // request, RequiredDuringScheduling affinity expressions, etc.),
    // compute a sum by iterating through the elements of this field and adding
    // "weight" to the sum if the node matches the corresponding MatchExpressions; the
    // node(s) with the highest sum are the most preferred.
    PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
}

type PodAntiAffinity struct {
    // If the anti-affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the anti-affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system will try to eventually evict the pod from its node.
    // When there are multiple elements, the lists of nodes corresponding to each
    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
    RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
    // If the anti-affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the anti-affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system may or may not try to eventually evict the pod from its node.
    // When there are multiple elements, the lists of nodes corresponding to each
    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
    RequiredDuringSchedulingIgnoredDuringExecution  []PodAffinityTerm  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
    // The scheduler will prefer to schedule pods to nodes that satisfy
    // the anti-affinity expressions specified by this field, but it may choose
    // a node that violates one or more of the expressions. The node that is
    // most preferred is the one with the greatest sum of weights, i.e.
    // for each node that meets all of the scheduling requirements (resource
    // request, RequiredDuringScheduling anti-affinity expressions, etc.),
    // compute a sum by iterating through the elements of this field and adding
    // "weight" to the sum if the node matches the corresponding MatchExpressions; the
    // node(s) with the highest sum are the most preferred.
    PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
}

type WeightedPodAffinityTerm struct {
    // weight is in the range 1-100
    Weight int  `json:"weight"`
    PodAffinityTerm PodAffinityTerm  `json:"podAffinityTerm"`
}

type PodAffinityTerm struct {
    LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
    // namespaces specifies which namespaces the LabelSelector applies to (matches against);
    // nil list means "this pod's namespace," empty list means "all namespaces"
    // The json tag here is not "omitempty" since we need to distinguish nil and empty.
    // See https://golang.org/pkg/encoding/json/#Marshal for more details.
    Namespaces []api.Namespace  `json:"namespaces,omitempty"`
    // empty topology key is interpreted by the scheduler as "all topologies"
    TopologyKey string `json:"topologyKey,omitempty"`
}

Note that the Namespaces field is necessary because normal LabelSelector is scoped to the pod's namespace, but we need to be able to match against all pods globally.

To explain how this API works, let's say that the PodSpec of a pod P has an Affinity that is configured as follows (note that we've omitted and collapsed some fields for simplicity, but this should sufficiently convey the intent of the design):

PodAffinity {
	RequiredDuringScheduling: {{LabelSelector: P1, TopologyKey: "node"}},
	PreferredDuringScheduling: {{LabelSelector: P2, TopologyKey: "zone"}},
}
PodAntiAffinity {
	RequiredDuringScheduling: {{LabelSelector: P3, TopologyKey: "rack"}},
	PreferredDuringScheduling: {{LabelSelector: P4, TopologyKey: "power"}}
}

Then when scheduling pod P, the scheduler:

  • Can only schedule P onto nodes that are running pods that satisfy P1. (Assumes all nodes have a label with key node and value specifying their node name.)
  • Should try to schedule P onto zones that are running pods that satisfy P2. (Assumes all nodes have a label with key zone and value specifying their zone.)
  • Cannot schedule P onto any racks that are running pods that satisfy P3. (Assumes all nodes have a label with key rack and value specifying their rack name.)
  • Should try not to schedule P onto any power domains that are running pods that satisfy P4. (Assumes all nodes have a label with key power and value specifying their power domain.)

When RequiredDuringScheduling has multiple elements, the requirements are ANDed. For PreferredDuringScheduling the weights are added for the terms that are satisfied for each node, and the node(s) with the highest weight(s) are the most preferred.

In reality there are two variants of RequiredDuringScheduling: one suffixed with RequiredDuringEecution and one suffixed with IgnoredDuringExecution. For the first variant, if the affinity/anti-affinity ceases to be met at some point during pod execution (e.g. due to a pod label update), the system will try to eventually evict the pod from its node. In the second variant, the system may or may not try to eventually evict the pod from its node.

A comment on symmetry

One thing that makes affinity and anti-affinity tricky is symmetry.

Imagine a cluster that is running pods from two services, S1 and S2. Imagine that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not run me on nodes that are running pods from S2." It is not sufficient just to check that there are no S2 pods on a node when you are scheduling a S1 pod. You also need to ensure that there are no S1 pods on a node when you are scheduling a S2 pod, even though the S2 pod does not have any anti-affinity rules. Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling anti-affinity rule, then:

  • if a node is empty, you can schedule S1 or S2 onto the node
  • if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node

Note that while RequiredDuringScheduling anti-affinity is symmetric, RequiredDuringScheduling affinity is not symmetric. That is, if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running pods from S2," it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node. More specifically, if S1 has the aforementioned RequiredDuringScheduling affinity rule, then:

  • if a node is empty, you can schedule S2 onto the node
  • if a node is empty, you cannot schedule S1 onto the node
  • if a node is running S2, you can schedule S1 onto the node
  • if a node is running S1+S2 and S1 terminates, S2 continues running
  • if a node is running S1+S2 and S2 terminates, the system terminates S1 (eventually)

However, although RequiredDuringScheduling affinity is not symmetric, there is an implicit PreferredDuringScheduling affinity rule corresponding to every RequiredDuringScheduling affinity rule: if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running pods from S2" then it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node, but it would be better if there are.

PreferredDuringScheduling is symmetric. If the pods of S1 had a PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that are running pods from S2" then we would prefer to keep a S1 pod that we are scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that we are scheduling off of nodes that are running S1 pods. Likewise if the pods of S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that are running pods from S2" then we would prefer to place a S1 pod that we are scheduling onto a node that is running a S2 pod, and also to place a S2 pod that we are scheduling onto a node that is running a S1 pod.

Examples

Here are some examples of how you would express various affinity and anti-affinity rules using the API we described.

Affinity

In the examples below, the word "put" is intentionally ambiguous; the rules are the same whether "put" means "must put" (RequiredDuringScheduling) or "try to put" (PreferredDuringScheduling)--all that changes is which field the rule goes into. Also, we only discuss scheduling-time, and ignore the execution-time. Finally, some of the examples use "zone" and some use "node," just to make the examples more interesting; any of the examples with "zone" will also work for "node" if you change the TopologyKey, and vice-versa.

  • Put the pod in zone Z: Tricked you! It is not possible express this using the API described here. For this you should use node affinity.

  • Put the pod in a zone that is running at least one pod from service S: {LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}

  • Put the pod on a node that is already running a pod that requires a license for software package P: Assuming pods that require a license for software package P have a label {key=license, value=P}: {LabelSelector: "license" In "P", TopologyKey: "node"}

  • Put this pod in the same zone as other pods from its same service: Assuming pods from this pod's service have some label {key=service, value=S}: {LabelSelector: "service" In "S", TopologyKey: "zone"}

This last example illustrates a small issue with this API when it is used with a scheduler that processes the pending queue one pod at a time, like the current Kubernetes scheduler. The RequiredDuringScheduling rule {LabelSelector: "service" In "S", TopologyKey: "zone"} only "works" once one pod from service S has been scheduled. But if all pods in service S have this RequiredDuringScheduling rule in their PodSpec, then the RequiredDuringScheduling rule will block the first pod of the service from ever scheduling, since it is only allowed to run in a zone with another pod from the same service. And of course that means none of the pods of the service will be able to schedule. This problem only applies to RequiredDuringScheduling affinity, not PreferredDuringScheduling affinity or any variant of anti-affinity. There are at least three ways to solve this problem:

  • short-term: have the scheduler use a rule that if the RequiredDuringScheduling affinity requirement matches a pod's own labels, and there are no other such pods anywhere, then disregard the requirement. This approach has a corner case when running parallel schedulers that are allowed to schedule pods from the same replicated set (e.g. a single PodTemplate): both schedulers may try to schedule pods from the set at the same time and think there are no other pods from that set scheduled yet (e.g. they are trying to schedule the first two pods from the set), but by the time the second binding is committed, the first one has already been committed, leaving you with two pods running that do not respect their RequiredDuringScheduling affinity. There is no simple way to detect this "conflict" at scheduling time given the current system implementation.
  • longer-term: when a controller creates pods from a PodTemplate, for exactly one of those pods, it should omit any RequiredDuringScheduling affinity rules that select the pods of that PodTemplate.
  • very long-term/speculative: controllers could present the scheduler with a group of pods from the same PodTemplate as a single unit. This is similar to the first approach described above but avoids the corner case. No special logic is needed in the controllers. Moreover, this would allow the scheduler to do proper gang scheduling since it could receive an entire gang simultaneously as a single unit.

Anti-affinity

As with the affinity examples, the examples here can be RequiredDuringScheduling or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as "must not" or as "try not to" depending on whether the rule appears in RequiredDuringScheduling or PreferredDuringScheduling.

  • Spread the pods of this service S across nodes and zones: {{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"}, {LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}} (note that if this is specified as a RequiredDuringScheduling anti-affinity, then the first clause is redundant, since the second clause will force the scheduler to not put more than one pod from S in the same zone, and thus by definition it will not put more than one pod from S on the same node, assuming each node is in one zone. This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one might expect it to be common in Cluster Federation clusters.)

  • Don't co-locate pods of this service with pods from service "evilService": {LabelSelector: selector that matches evilService's pods, TopologyKey: "node"}

  • Don't co-locate pods of this service with any other pods including pods of this service: {LabelSelector: empty, TopologyKey: "node"}

  • Don't co-locate pods of this service with any other pods except other pods of this service: Assuming pods from the service have some label {key=service, value=S}: {LabelSelector: "service" NotIn "S", TopologyKey: "node"} Note that this works because "service" NotIn "S" matches pods with no key "service" as well as pods with key "service" and a corresponding value that is not "S."

Algorithm

An example algorithm a scheduler might use to implement affinity and anti-affinity rules is as follows. There are certainly more efficient ways to do it; this is just intended to demonstrate that the API's semantics are implementable.

Terminology definition: We say a pod P is "feasible" on a node N if P meets all of the scheduler predicates for scheduling P onto N. Note that this algorithm is only concerned about scheduling time, thus it makes no distinction between RequiredDuringExecution and IgnoredDuringExecution.

To make the algorithm slightly more readable, we use the term "HardPodAffinity" as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and "SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity." Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."

** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} into account; currently it assumes all terms have weight 1. **

Z = the pod you are scheduling
{N} = the set of all nodes in the system  // this algorithm will reduce it to the set of all nodes feasible for Z
// Step 1a: Reduce {N} to the set of nodes satisfying Z's HardPodAffinity in the "forward" direction
X = {Z's PodSpec's HardPodAffinity}
foreach element H of {X}
	P = {all pods in the system that match H.LabelSelector}
	M map[string]int  // topology value -> number of pods running on nodes with that topology value
	foreach pod Q of {P}
		L = {labels of the node on which Q is running, represented as a map from label key to label value}
		M[L[H.TopologyKey]]++
	{N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]>0]}
// Step 1b: Further reduce {N} to the set of nodes also satisfying Z's HardPodAntiAffinity
// This step is identical to Step 1a except the M[K] > 0 comparison becomes M[K] == 0
X = {Z's PodSpec's HardPodAntiAffinity}
foreach element H of {X}
	P = {all pods in the system that match H.LabelSelector}
	M map[string]int  // topology value -> number of pods running on nodes with that topology value
	foreach pod Q of {P}
		L = {labels of the node on which Q is running, represented as a map from label key to label value}
		M[L[H.TopologyKey]]++
	{N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]==0]}
// Step 2: Further reduce {N} by enforcing symmetry requirement for other pods' HardPodAntiAffinity
foreach node A of {N}
	foreach pod B that is bound to A
		if any of B's HardPodAntiAffinity are currently satisfied but would be violated if Z runs on A, then remove A from {N}
// At this point, all node in {N} are feasible for Z.
// Step 3a: Soft version of Step 1a
Y map[string]int  // node -> number of Z's soft affinity/anti-affinity preferences satisfied by that node
Initialize the keys of Y to all of the nodes in {N}, and the values to 0
X = {Z's PodSpec's SoftPodAffinity}
Repeat Step 1a except replace the last line with "foreach node W of {N} having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
// Step 3b: Soft version of Step 1b
X = {Z's PodSpec's SoftPodAntiAffinity}
Repeat Step 1b except replace the last line with "foreach node W of {N} not having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
// Step 4: Symmetric soft, plus treat forward direction of hard affinity as a soft
foreach node A of {N}
	foreach pod B that is bound to A
		increment Y[A] by the number of B's SoftPodAffinity, SoftPodAntiAffinity, and HardPodAffinity that are satisfied if Z runs on A but are not satisfied if Z does not run on A
// We're done. {N} contains all of the nodes that satisfy the affinity/anti-affinity rules, and Y is
// a map whose keys are the elements of {N} and whose values are how "good" of a choice N is for Z with
// respect to the explicit and implicit affinity/anti-affinity rules (larger number is better).

Special considerations for RequiredDuringScheduling anti-affinity

In this section we discuss three issues with RequiredDuringScheduling anti-affinity: Denial of Service (DoS), co-existing with daemons, and determining which pod(s) to kill. See issue #18265 for additional discussion of these topics.

Denial of Service

Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity can intentionally or unintentionally cause various problems for other pods, due to the symmetry property of anti-affinity.

The most notable danger is the ability for a pod that arrives first to some topology domain, to block all other pods from scheduling there by stating a conflict with all other pods. The standard approach to preventing resource hogging is quota, but simple resource quota cannot prevent this scenario because the pod may request very little resources. Addressing this using quota requires a quota scheme that charges based on "opportunity cost" rather than based simply on requested resources. For example, when handling a pod that expresses RequiredDuringScheduling anti-affinity for all pods using a "node" TopologyKey (i.e. exclusive access to a node), it could charge for the resources of the average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling anti-affinity for all pods using a "cluster" TopologyKey, it could charge for the resources of the entire cluster. If node affinity is used to constrain the pod to a particular topology domain, then the admission-time quota charging should take that into account (e.g. not charge for the average/largest machine if the PodSpec constrains the pod to a specific machine with a known size; instead charge for the size of the actual machine that the pod was constrained to). In all cases once the pod is scheduled, the quota charge should be adjusted down to the actual amount of resources allocated (e.g. the size of the actual machine that was assigned, not the average/largest). If a cluster administrator wants to overcommit quota, for example to allow more than N pods across all users to request exclusive node access in a cluster with N nodes, then a priority/preemption scheme should be added so that the most important pods run when resource demand exceeds supply.

An alternative approach, which is a bit of a blunt hammer, is to use a capability mechanism to restrict use of RequiredDuringScheduling anti-affinity to trusted users. A more complex capability mechanism might only restrict it when using a non-"node" TopologyKey.

Our initial implementation will use a variant of the capability approach, which requires no configuration: we will simply reject ALL requests, regardless of user, that specify "all namespaces" with non-"node" TopologyKey for RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use case while prohibiting the more dangerous ones.

A weaker variant of the problem described in the previous paragraph is a pod's ability to use anti-affinity to degrade the scheduling quality of another pod, but not completely block it from scheduling. For example, a set of pods S1 could use node affinity to request to schedule onto a set of nodes that some other set of pods S2 prefers to schedule onto. If the pods in S1 have RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for S2, then due to the symmetry property of anti-affinity, they can prevent the pods in S2 from scheduling onto their preferred nodes if they arrive first (for sure in the RequiredDuringScheduling case, and with some probability that depends on the weighting scheme for the PreferredDuringScheduling case). A very sophisticated priority and/or quota scheme could mitigate this, or alternatively we could eliminate the symmetry property of the implementation of PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling anti-affinity could affect scheduling quality of another pod, and as we described in the previous paragraph, such pods could be charged quota for the full topology domain, thereby reducing the potential for abuse.

We won't try to address this issue in our initial implementation; we can consider one of the approaches mentioned above if it turns out to be a problem in practice.

Co-existing with daemons

A cluster administrator may wish to allow pods that express anti-affinity against all pods, to nonetheless co-exist with system daemon pods, such as those run by DaemonSet. In principle, we would like the specification for RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or more other pods (see #18263 for a more detailed explanation of the toleration concept). There are at least two ways to accomplish this:

  • Scheduler special-cases the namespace(s) where daemons live, in the sense that it ignores pods in those namespaces when it is determining feasibility for pods with anti-affinity. The name(s) of the special namespace(s) could be a scheduler configuration parameter, and default to kube-system. We could allow multiple namespaces to be specified if we want cluster admins to be able to give their own daemons this special power (they would add their namespace to the list in the scheduler configuration). And of course this would be symmetric, so daemons could schedule onto a node that is already running a pod with anti-affinity.

  • We could add an explicit "toleration" concept/field to allow the user to specify namespaces that are excluded when they use RequiredDuringScheduling anti-affinity, and use an admission controller/defaulter to ensure these namespaces are always listed.

Our initial implementation will use the first approach.

Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution)

Because anti-affinity is symmetric, in the case of RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must determine which pod(s) to kill when a pod's labels are updated in such as way as to cause them to conflict with one or more other pods' RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the absence of a priority/preemption scheme, our rule will be that the pod with the anti-affinity rule that becomes violated should be the one killed. A pod should only specify constraints that apply to namespaces it trusts to not do malicious things. Once we have priority/preemption, we can change the rule to say that the lowest-priority pod(s) are killed until all RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied.

Special considerations for RequiredDuringScheduling affinity

The DoS potential of RequiredDuringScheduling anti-affinity stemmed from its symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with conflicting pods, and pods that conflict with P cannot schedule onto the node one P has been scheduled there. The design we have described says that the symmetry property for RequiredDuringScheduling affinity is weaker: if a pod P says it can only schedule onto nodes running pod Q, this does not mean Q can only run on a node that is running P, but the scheduler will try to schedule Q onto a node that is running P (i.e. treats the reverse direction as preferred). This raises the same scheduling quality concern as we mentioned at the end of the Denial of Service section above, and can be addressed in similar ways.

The nature of affinity (as opposed to anti-affinity) means that there is no issue of determining which pod(s) to kill when a pod's labels change: it is obviously the pod with the affinity rule that becomes violated that must be killed. (Killing a pod never "fixes" violation of an affinity rule; it can only "fix" violation an anti-affinity rule.) However, affinity does have a different question related to killing: how long should the system wait before declaring that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q is temporarily killed so that it can be updated to a new binary version, should that trigger killing of P? More generally, how long should the system wait before declaring that P's affinity is violated? (Of course affinity is expressed in terms of label selectors, not for a specific pod, but the scenario is easier to describe using a concrete pod.) This is closely related to the concept of forgiveness (see issue #1574). In theory we could make this time duration be configurable by the user on a per-pod basis, but for the first version of this feature we will make it a configurable property of whichever component does the killing and that applies across all pods using the feature. Making it configurable by the user would require a nontrivial change to the API syntax (since the field would only apply to RequiredDuringSchedulingRequiredDuringExecution affinity).

Implementation plan

  1. Add the Affinity field to PodSpec and the PodAffinity and PodAntiAffinity types to the API along with all of their descendant types.
  2. Implement a scheduler predicate that takes RequiredDuringSchedulingIgnoredDuringExecution affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod).
  3. Implement a scheduler priority function that takes PreferredDuringSchedulingIgnoredDuringExecution affinity and anti-affinity into account.
  4. Implement admission controller that rejects requests that specify "all namespaces" with non-"node" TopologyKey for RequiredDuringScheduling anti-affinity. This admission controller should be enabled by default.
  5. Implement the recommended solution to the "co-existing with daemons" issue
  6. At this point, the feature can be deployed.
  7. Add the RequiredDuringSchedulingRequiredDuringExecution field to affinity and anti-affinity, and make sure the pieces of the system already implemented for RequiredDuringSchedulingIgnoredDuringExecution also take RequiredDuringSchedulingRequiredDuringExecution into account (e.g. the scheduler predicate, the quota mechanism, the "co-existing with daemons" solution).
  8. Add RequiredDuringSchedulingRequiredDuringExecution for "node" TopologyKey to Kubelet's admission decision.
  9. Implement code in Kubelet or the controllers that evicts a pod that no longer satisfies RequiredDuringSchedulingRequiredDuringExecution. If Kubelet then only for "node" TopologyKey; if controller then potentially for all TopologyKeys's. (see this comment). Do so in a way that addresses the "determining which pod(s) to kill" issue.

We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling domains (e.g. node name, rack name, availability zone name, etc.). See #9044.

Backward compatibility

Old versions of the scheduler will ignore Affinity.

Users should not start using Affinity until the full implementation has been in Kubelet and the master for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet or master to a version that does not support them. Longer-term we will use a programmatic approach to enforcing this (#4855).

Extensibility

The design described here is the result of careful analysis of use cases, a decade of experience with Borg at Google, and a review of similar features in other open-source container orchestration systems. We believe that it properly balances the goal of expressiveness against the goals of simplicity and efficiency of implementation. However, we recognize that use cases may arise in the future that cannot be expressed using the syntax described here. Although we are not implementing an affinity-specific extensibility mechanism for a variety of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes users to get a consistent experience, etc.), the regular Kubernetes annotation mechanism can be used to add or replace affinity rules. The way this work would is:

  1. Define one or more annotations to describe the new affinity rule(s)
  2. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior. If the new rule(s) replace one or more fields of Affinity then the user would omit those fields from Affinity; if they are additional rules, then the user would fill in Affinity as well as the annotation(s).
  3. Scheduler takes the annotation(s) into account when scheduling.

If some particular new syntax becomes popular, we would consider upstreaming it by integrating it into the standard Affinity.

Future work and non-work

One can imagine that in the anti-affinity RequiredDuringScheduling case one might want to associate a number with the rule, for example "do not allow this pod to share a rack with more than three other pods (in total, or from the same service as the pod)." We could allow this to be specified by adding an integer Limit to PodAffinityTerm just for the RequiredDuringScheduling case. However, this flexibility complicates the system and we do not intend to implement it.

It is likely that the specification and implementation of pod anti-affinity can be unified with taints and tolerations, and likewise that the specification and implementation of pod affinity can be unified with node affinity. The basic idea is that pod labels would be "inherited" by the node, and pods would only be able to specify affinity and anti-affinity for a node's labels. Our main motivation for not unifying taints and tolerations with pod anti-affinity is that we foresee taints and tolerations as being a concept that only cluster administrators need to understand (and indeed in some setups taints and tolerations wouldn't even be directly manipulated by a cluster administrator, instead they would only be set by an admission controller that is implementing the administrator's high-level policy about different classes of special machines and the users who belong to the groups allowed to access them). Moreover, the concept of nodes "inheriting" labels from pods seems complicated; it seems conceptually simpler to separate rules involving relatively static properties of nodes from rules involving which other pods are running on the same node or larger topology domain.

Data/storage affinity is related to pod affinity, and is likely to draw on some of the ideas we have used for pod affinity. Today, data/storage affinity is expressed using node affinity, on the assumption that the pod knows which node(s) store(s) the data it wants. But a more flexible approach would allow the pod to name the data rather than the node.

Related issues

The review for this proposal is in #18265.

The topic of affinity/anti-affinity has generated a lot of discussion. The main issue is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341, #1965, and #2906 all have additional discussion and use cases.

As the examples in this document have demonstrated, topological affinity is very useful in clusters that are spread across availability zones, e.g. to co-locate pods of a service in the same zone to avoid a wide-area network hop, or to spread pods across zones for failure tolerance. #17059, #13056, #13063, and #4235 are relevant.

Issue #15675 describes connection affinity, which is vaguely related.

This proposal is to satisfy #14816.

Related work

** TODO: cite references **

Analytics