Status: Design & Implementation in progress.
Contact @HaiyangDING for questions & suggestions.
In current Kubernetes design, there is only one default scheduler in a Kubernetes cluster. However it is common that multiple types of workload, such as traditional batch, DAG batch, streaming and user-facing production services, are running in the same cluster and they need to be scheduled in different ways. For example, in Omega batch workload and service workload are scheduled by two types of schedulers: the batch workload is scheduled by a scheduler which looks at the current usage of the cluster to improve the resource usage rate and the service workload is scheduled by another one which considers the reserved resources in the cluster and many other constraints since their performance must meet some higher SLOs. Mesos has done a great work to support multiple schedulers by building a two-level scheduling structure. This proposal describes how Kubernetes is going to support multi-scheduler so that users could be able to run their user-provided scheduler(s) to enable some customized scheduling behavior as they need. As previously discussed in #11793, #9920 and #11470, the design of the multiple scheduler should be generic and includes adding a scheduler name annotation to separate the pods. It is worth mentioning that the proposal does not address the question of how the scheduler name annotation gets set although it is reasonable to anticipate that it would be set by a component like admission controller/initializer, as the doc currently does.
Before going to the details of this proposal, below lists a number of the methods to extend the scheduler:
- Write your own scheduler and run it along with Kubernetes native scheduler. This is going to be detailed in this proposal
- Use the callout approach such as the one implemented in #13580
- Recompile the scheduler with a new policy
- Restart the scheduler with a new scheduler policy config file
- Or maybe in future dynamically link a new policy into the running scheduler
-
Separating the pods
Each pod should be scheduled by only one scheduler. As for implementation, a pod should have an additional field to tell by which scheduler it wants to be scheduled. Besides, each scheduler, including the default one, should have a unique logic of how to add unscheduled pods to its to-be-scheduled pod queue. Details will be explained in later sections.
-
Dealing with conflicts
Different schedulers are essentially separated processes. When all schedulers try to schedule their pods onto the nodes, there might be conflicts.
One example of the conflicts is resource racing: Suppose there be a
pod1
scheduled bymy-scheduler
requiring 1 CPU's request, and apod2
scheduled bykube-scheduler
(k8s native scheduler, acting as default scheduler) requiring 2 CPU's request, whilenode-a
only has 2.5 free CPU's, if both schedulers all try to put their pods onnode-a
, then one of them would eventually fail when Kubelet onnode-a
performs the create action due to insufficient CPU resources.This conflict is complex to deal with in api-server and etcd. Our current solution is to let Kubelet to do the conflict check and if the conflict happens, effected pods would be put back to scheduler and waiting to be scheduled again. Implementation details are in later sections.
We definitely want the multi-scheduler design to be a generic mechanism. The following lists the changes we want to make in the first step.
-
Add an annotation in pod template:
scheduler.alpha.kubernetes.io/name: scheduler-name
, this is used to separate pods between schedulers.scheduler-name
should match one of the schedulers'scheduler-name
-
Add a
scheduler-name
to each scheduler. It is done by hardcode or as command-line argument. The Kubernetes native scheduler (nowkube-scheduler
process) would have the name askube-scheduler
-
The
scheduler-name
plays an important part in separating the pods between different schedulers. Pods are statically dispatched to different schedulers based onscheduler.alpha.kubernetes.io/name: scheduler-name
annotation and there should not be any conflicts between different schedulers handling their pods, i.e. one pod must NOT be claimed by more than one scheduler. To be specific, a scheduler can add a pod to its queue if and only if:-
The pod has no nodeName, AND
-
The
scheduler-name
specified in the pod's annotationscheduler.alpha.kubernetes.io/name: scheduler-name
matches thescheduler-name
of the scheduler.The only one exception is the default scheduler. Any pod that has no
scheduler.alpha.kubernetes.io/name: scheduler-name
annotation is assumed to be handled by the "default scheduler". In the first version of the multi-scheduler feature, the default scheduler would be the Kubernetes built-in scheduler withscheduler-name
askube-scheduler
. The Kubernetes build-in scheduler will claim any pod which has noscheduler.alpha.kubernetes.io/name: scheduler-name
annotation or which hasscheduler.alpha.kubernetes.io/name: kube-scheduler
. In the future, it may be possible to change which scheduler is the default for a given cluster.
-
-
Dealing with conflicts. All schedulers must use predicate functions that are at least as strict as the ones that Kubelet applies when deciding whether to accept a pod, otherwise Kubelet and scheduler may get into an infinite loop where Kubelet keeps rejecting a pod and scheduler keeps re-scheduling it back the same node. To make it easier for people who write new schedulers to obey this rule, we will create a library containing the predicates Kubelet uses. (See issue #12744.)
In summary, in the initial version of this multi-scheduler design, we will achieve the following:
- If a pod has the annotation
scheduler.alpha.kubernetes.io/name: kube-scheduler
or the user does not explicitly sets this annotation in the template, it will be picked up by default scheduler - If the annotation is set and refers to a valid
scheduler-name
, it will be picked up by the scheduler of specifiedscheduler-name
- If the annotation is set but refers to an invalid
scheduler-name
, the pod will not be picked by any scheduler. The pod will keep PENDING.
kind: Pod
apiVersion: v1
metadata:
name: pod-abc
labels:
foo: bar
annotations:
scheduler.alpha.kubernetes.io/name: my-scheduler
This pod will be scheduled by "my-scheduler" and ignored by "kube-scheduler". If there is no running scheduler of name "my-scheduler", the pod will never be scheduled.
- Use admission controller to add and verify the annotation, and do some modification if necessary. For example, the admission controller might add the scheduler annotation based on the namespace of the pod, and/or identify if there are conflicting rules, and/or set a default value for the scheduler annotation, and/or reject pods on which the client has set a scheduler annotation that does not correspond to a running scheduler.
- Dynamic launching scheduler(s) and registering to admission controller (as an external call). This also requires some work on authorization and authentication to control what schedulers can write the /binding subresource of which pods.
- Optimize the behaviors of priority functions in multi-scheduler scenario. In the case where multiple schedulers have
the same predicate and priority functions (for example, when using multiple schedulers for parallelism rather than to
customize the scheduling policies), all schedulers would tend to pick the same node as "best" when scheduling identical
pods and therefore would be likely to conflict on the Kubelet. To solve this problem, we can pass
an optional flag such as
--randomize-node-selection=N
to scheduler, setting this flag would cause the scheduler to pick randomly among the top N nodes instead of the one with the highest score.