KEP-4832: Asynchronous preemption in the scheduler #4833

sanposhiho · 2024-09-07T12:44:17Z

One-line PR description: Add KEP-4832.

Issue link: Asynchronous preemption in the scheduler #4832

Other comments:

sanposhiho · 2024-09-07T12:47:32Z

/cc @alculquicondor @Huang-Wei @mengqiy

sanposhiho · 2024-09-07T12:47:51Z

/cc @hakuna-matatah

alculquicondor · 2024-09-09T14:41:04Z

cc @macsko @dom4ha

keps/sig-scheduling/4832-async-postfilter/README.md

keps/prod-readiness/sig-scheduling/4832.yaml

dom4ha · 2024-09-17T16:01:57Z

Moving the discussion from kubernetes/kubernetes#125491 (comment). I'm mainly concerned that making preemption API calls async will be replaced with setting nominatedNodeName, which IIUC is also an API call (the reservation has to be visible for autoscaler, so cannot be just within the scheduler cache, which I think is assumed in this KEP).

Is it a valid concern? If yes, then calling AddNominatedPod (setting nominatedNodeName) async as well maybe could help. I'm not suggesting adding new extension point here though.

sanposhiho · 2024-09-17T16:24:20Z

making preemption API calls async will be replaced with setting nominatedNodeName, which IIUC is also an API call

Setting nominatedNodeName requires API call, but we won't do that at PostFilter. We will only do AddNominatedPod, which doesn't make any API calls since it just modifies the cache. And any actual API calls for the pod deletion and setting nominatedNodeName will be done asynchronously. That's what this KEP proposes.

the reservation has to be visible for autoscaler, so cannot be just within the scheduler cache

The reservation has to be visible once the preemption (= actual pod deletion) is done because otherwise CA would make incorrect decision with the preemption target nodes, which is likely low-utilized just after the preemption.
In other words, not need until then, I believe.

sanposhiho · 2024-10-01T11:00:52Z

@alculquicondor Updated the KEP and made some replies to your comments

tenzen-y

This is so interesting feature.

keps/sig-scheduling/4832-async-preemption/README.md

wojtek-t

Few initial comments from the PRR

wojtek-t · 2024-10-07T07:47:04Z

keps/sig-scheduling/4832-async-preemption/README.md

+because it works with a lot of communication with kube-apiserver.
+Even if the scheduler makes the best scheduling result, the binding API might fail after all.
+
+So, we don't have to pay a special attention to this issue.


I don't fully agree with this. We have to pay attention to it, but it's orthogonal to this KEP.

How would you change the sentence? (From my non-native-speaker eyes, a "special" attention delivers the nuance.)

keps/sig-scheduling/4832-async-preemption/README.md

sanposhiho · 2024-10-09T04:44:18Z

Addressed reviews from @wojtek-t.

wojtek-t · 2024-10-09T07:04:26Z

I'm generally fine with PRR for Alpha, but you should get SIG approval first.

sanposhiho · 2024-10-09T14:44:34Z

@alculquicondor Applied your suggestion.

keps/sig-scheduling/4832-async-preemption/README.md

alculquicondor · 2024-10-09T15:57:43Z

keps/sig-scheduling/4832-async-preemption/README.md

+- `goroutines_duration_seconds` (w/ label: `operation`): to observe how many preemption goroutines have failed.
+- `goroutines_execution_total` (w/ labels: `operation`, `result`): to observe how long each preemption goroutine takes to complete.


I'm not sure what is the difference between the two. Can you review?

Oops. Sorry, the explanation was the opposite.

keps/sig-scheduling/4832-async-preemption/kep.yaml

sanposhiho · 2024-10-09T23:58:55Z

@alculquicondor Done.

kerthcet

So this is just a logic change to the preemption plugin rather than the framework, I mean the postFilter extension point, right? And I have several other questions:

Because preemption will also process the filter logic, so is the API calls the bottleneck, compared to the whole consuming time, which helps us to weight whether the complexity worths.
Will this leads to behavior change for other postFilter plugins? Seems no.

I may missed something, if so, sorry for that.

kerthcet · 2024-10-10T06:17:37Z

keps/sig-scheduling/4832-async-preemption/README.md

+The scheduler schedules Pods one by one within the scheduling cycle, 
+and we basically try to reduce the API calls as much as possible to enhance the scheduling cycle throughput.
+
+The binding cycle is the example for this motivation; 


Binding cycle is more than avoiding API calls, it's a decouple stage based on the assumption that Pod is scheduled, then you can post-perform the actions based on your requirements. but yes, async API calls can improve the throughput.

keps/sig-scheduling/4832-async-preemption/README.md

kerthcet · 2024-10-10T07:41:22Z

keps/sig-scheduling/4832-async-preemption/README.md

+2. The preemption PostFilter plugin starts the goroutine to make API calls inside, and return success status (= not wait for the goroutine to finish).
+3. The preemption plugin blocks the Pod while the preemption routine is in-progress, using PreEnqueue extension point, so that the target Pod won't be retried during this time.
+
+Then, afterwards the preemption goroutine makes actual API calls to delete victime Pods and set `Pod.Status.NominatedNodeName`. 


Can you describe more details about when to requeue the Pod, what's the indicator? Watching for the pod.Status?

Will change the explanation here a bit (but, don't want to explain a lot on KEP because it's an implementation detail)

So, it'll be like:

the preemption plugin has a map or something to record which unsched Pod is making the preemption API calls in goroutine.

While making the API calls in goroutine, the Pod is gated so that it won't be requeued to activeQ/backoffQ.

When finishing the goroutine, the preemption plugin no longer gate this Pod.

Pod/delete event comes because of the preemption, triggering the Pod requeueing.

Step3 is what I'm lost, it's a async programming, how to know that we finished the evictions.

All victim Pods are deleted by the preemption. The preemption plugin no longer gates this Pod at PreEnqueue.

Pod/delete (victim Pods) arrives at the scheduling queue, that moves the Pods to the activeQ/backoffQ.

Ok, we may get more details in the implementation.

Another implementation detail: will the pod be considered "inflight" (for the hints feature) while the routine is running? I think it doesn't have to be.

Yes, they don't have to. Rather they should just ignore all events (by being blocked at preemption plugin's PreEnqueue) while the preemption routine is running.

kerthcet · 2024-10-10T08:18:14Z

keps/sig-scheduling/4832-async-preemption/README.md

+then the preemption for pod2 may just select the same preemption targets as pod1,
+and when pod1 comes back to the scheduling cycle, it (probably) cannot be scheduled on node1 because of pod2.
+
+But, this isn't an issue because the final result is completely the same as the current scheduler;


This is different, because with this design, we'll delete the Pods async, during that time window, we may start the next scheduling cycle and snapshot the cluster, in a small cluster, this may race more often.

So, here explains that it's the same result eventually even when the pod deletion isn't done by the start of next scheduling cycle, right? Is this explanation missing some edge cases?

Yes, eventually we will get the same result. What I want to highlight is once two preemptions would like to delete the same victims (because of the async eviction), it will lead to a waste of One scheduling cycle and useless evictions. I think it's serious.

So, did you want to argue the first scheduling cycle for pod2 is a waste, compared to the current scheduler, because if we wait for pod1's preemption to be done before pod2's scheduling, pod2 can go to the node straight away?

I'm not sure if it makes a serious performance difference, or could happen seriously-frequently (unless, again, I'm not missing anything).
Here's a detailed explanation (long story warning!!! as usual 😅)

The scenario that we're talking here is:

the cluster is super packed.

pod1's priority < pod2's priority.

pod1 cannot find the place and triggers the preemption.

pod2 comes after pod1, pod2 cannot find the place without any pod deletion.

And, the current scheduler behaves:

pod1 triggers the preemption.

The scheduler just waits for (1) to be done. (Note that, at this point, the victim Pods just have deletionTimestamp, not truly deleted)

If victim Pods are truly deleted before the scheduling cycle start, pod2's scheduling cycle starts and pod2 can go straight to node1 where pod1 makes a space with the preemption. If not, pod2 cannot find the place and goes back to the queue.

The proposed scheduler behaves:

pod1 triggers the preemption.

pod2's scheduling cycle starts without waiting for (1) to be done. And the scheduler finds that pod2 cannot go anywhere.

pod2 triggers the preemption and (let's say) selects the same node as pod1.

First, as written at the current scheduler's third step,
note that, if not, pod2's scheduling cycle is wasted, which is completely the same as the proposed scheduler.
So, even the current scheduler could waste one scheduling cycle, in the first place.
Rather, actually, given it usually takes some time for Pods to be truly deleted, it's more likely to waste, than not to waste.

But, okay, let's consider what would happen in case the current scheduler doesn't waste; meaning all victim Pods are truly deleted luckily before the pod2's scheduling cycle start.
In this case, let's say API calls for the preemption take 1 second to be all done.
Then, the time that the scheduler takes to schedule pod2 finally is 1 second + the victim Pods are all truly deleted + the time for one scheduling cycle.
Given "the time for one scheduling cycle" is very small, we can roughly say it's 1 second + the victim Pods are all truly deleted.

Then, when it comes to the proposed scheduler, it'll be 1 second + max{(the victim Pods are all truly deleted), (the backoff time for pod2)} + the time for _two_ scheduling cycles.
Given, again "the time for one scheduling cycle" is very small, we can roughly say it's 1 second + max{(the victim Pods are all truly deleted), (the backoff time for pod2)}.

So, as a result, a rough time comparison between two schedulers would be:

The current scheduler: 1 second + the victim Pods are all truly deleted

The new scheduler: 1 second + max{(the victim Pods are all truly deleted), (the backoff time for pod2)}.

So, if (the victim Pods are all truly deleted) > (the backoff time for pod2), then no time difference.
Also, I believe the time difference would be small, even if there is.

To summarize, if you want to observe the current scheduler is faster for pod2 than the proposed scheduler, all the following conditions have to be met:
General condition:

the cluster is super packed.

pod1 cannot find the place and triggers the preemption.

pod2 comes after pod1, pod2 cannot find the place too without any pod deletion.

The current scheduler:

pod2 has to come to the activeQ before pod1 starts a second scheduling cycle, but after all victim Pods (deleted by pod1's preemption) are truly deleted.

The proposed scheduler:

pod2 selects the same node as pod1 in its preemption.

pod2's backoff time is longer than the time for the victim Pods to all truly deleted.

When you're so fortunate that all the conditions are met, you can finally see the performance degradation and the difference is (the backoff time for pod2) - (the victim Pods are all truly deleted), which should be not that big difference.

Okay, looking back at my reply just made, I found my reply too long. 😅
So, in very short, I guess you missed that, even with the current scheduler, the Pods aren't truly deleted after the preemption (because the preemption just puts the deletionTimestamp on victim Pods).

The current scheduler could likely make a one wasted scheduling cycle for pod2, like the proposed scheduler.

Even if it doesn't make luckily, the time difference made by the wasted scheduling cycle in the proposed scheduler would be likely small.

Yes, there's, again, many assumptions ("likely") though, in the first place, this scenario that we're talking itself is also not something that very frequently happens in many clusters.

Thanks Kensei, it's a long typing. With the consideration of the maturity of Kubernetes today, I believe stability is more important than new features, so I'm more concerned about what side effects would be introduced into the kubernetes. What I mentioned above is one side, as the same victim would be chosen for two different preemptions, another one is the nominator info would be stale in the cache, for example the eviction API calls are failed, both of them may lead to chaos (I hope I worried too much)

However, I think there's no reason to block this goes into the alpha stage, and Aldo already LGTMed, but when graduating to Beta, I hope we can take these risks into consideration, so LGTM as well.
/lgtm

Let's update the risk to include the case?

the same victim would be chosen for two different preemptions

I do not consider it as a risk both from a functional perspective and a performance perspective
as the KEP and my above too-long comment explain it's safe even if two goroutines select the same Pod(s) as a preemption victim.

another one is the nominator info would be stale in the cache, for example the eviction API calls are failed

Yes, it is a risk in case kube-apiserver (or network etc) is unstable and the API call fails. And, we already mentioned it in the risk section.

I wouldn't argue more on this since we're all aware of the case but if you're evicting a bunch of Pods however this wouldn't make the preemptor Pod schedulable, or even worse make the cluster less balance, I would take it a risk. We do have similar situations right now (we do not guarantee the preemptor pod will be schedulable again), but we're exacerbating this.

keps/sig-scheduling/4832-async-preemption/README.md

sanposhiho · 2024-10-10T09:36:26Z

So this is just a logic change to the preemption plugin rather than the framework, I mean the postFilter extension point, right?

Yes.

Because preemption will also process the filter logic, so is the API calls the bottleneck, compared to the whole consuming time, which helps us to weight whether the complexity worths.

Yes.

Will this leads to behavior change for other postFilter plugins? Seems no.

No. It's just within our in-tree preemption plugin. Although, I initially proposed to create a new extension point (see the alternative section) so that other people can do the similar though, we decided to scope out it, at least for now.

alculquicondor

/approve
wonderful

alculquicondor · 2024-10-10T14:56:29Z

/lgtm

sanposhiho · 2024-10-10T15:00:20Z

@wojtek-t 🙏

wojtek-t · 2024-10-11T07:39:43Z

/lgtm
/approve PRR

@sanposhiho - I suggest filling-in an exception request and blaming me for PRR delay of couple hours.
As a supporting fact, you can use this comment: #4833 (comment) from 2 days ago, suggesting that it can be considered as not delayed.

k8s-ci-robot · 2024-10-11T07:39:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, sanposhiho, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-scheduling/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Sep 7, 2024

k8s-ci-robot requested review from alculquicondor and johnbelamaric September 7, 2024 12:44

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 7, 2024

sanposhiho force-pushed the async-preemption branch from df7c1a0 to 7204c57 Compare September 7, 2024 12:44

k8s-ci-robot requested review from Huang-Wei and mengqiy September 7, 2024 12:47

k8s-ci-robot requested a review from hakuna-matatah September 7, 2024 12:47

sanposhiho mentioned this pull request Sep 7, 2024

Run scheduler PostFilters asynchronously to improve throughput kubernetes/kubernetes#126858

Open

sanposhiho force-pushed the async-preemption branch 3 times, most recently from 6c24651 to a0d192c Compare September 7, 2024 12:55

KEP-4832: Asynchronous PostFilter

058dfeb

sanposhiho force-pushed the async-preemption branch from a0d192c to 058dfeb Compare September 7, 2024 12:57

alculquicondor reviewed Sep 9, 2024

View reviewed changes

keps/sig-scheduling/4832-async-postfilter/README.md Outdated Show resolved Hide resolved

fix: swap the main proposal and the alternative

83e27db

sanposhiho changed the title ~~KEP-4832: Asynchronous PostFilter~~ KEP-4832: Asynchronous preemption in the scheduler Sep 10, 2024

fix: rename the title

a819e9d

sanposhiho commented Sep 10, 2024

View reviewed changes

keps/prod-readiness/sig-scheduling/4832.yaml Show resolved Hide resolved

chore: update toc

e919063

sanposhiho mentioned this pull request Sep 7, 2024

Asynchronous preemption in the scheduler #4832

Open

4 tasks

alculquicondor mentioned this pull request Sep 17, 2024

Scheduler pre-binding can cause race conditions with automated empty node removal kubernetes/kubernetes#125491

Open

wojtek-t self-assigned this Sep 20, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 1, 2024

sanposhiho requested a review from alculquicondor October 1, 2024 11:00

tenzen-y reviewed Oct 1, 2024

View reviewed changes

keps/sig-scheduling/4832-async-preemption/README.md Show resolved Hide resolved

wojtek-t reviewed Oct 7, 2024

View reviewed changes

fix: address reviews

d0bc09a

sanposhiho requested a review from wojtek-t October 9, 2024 04:44

chore: wording

f4be31f

alculquicondor reviewed Oct 9, 2024

View reviewed changes

fix: address reviews

ad7d5dd

sanposhiho requested a review from alculquicondor October 9, 2024 23:58

kerthcet reviewed Oct 10, 2024

View reviewed changes

add an explanation around requeueing a bit

388e37e

alculquicondor reviewed Oct 10, 2024

View reviewed changes

k8s-ci-robot assigned alculquicondor Oct 10, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2024

k8s-ci-robot assigned kerthcet Oct 11, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 11, 2024

k8s-ci-robot merged commit ca601e6 into kubernetes:master Oct 11, 2024
4 checks passed

k8s-ci-robot added this to the v1.32 milestone Oct 11, 2024

		- `goroutines_duration_seconds` (w/ label: `operation`): to observe how many preemption goroutines have failed.
		- `goroutines_execution_total` (w/ labels: `operation`, `result`): to observe how long each preemption goroutine takes to complete.

KEP-4832: Asynchronous preemption in the scheduler #4833

KEP-4832: Asynchronous preemption in the scheduler #4833

Conversation

sanposhiho commented Sep 7, 2024

sanposhiho commented Sep 7, 2024

sanposhiho commented Sep 7, 2024

alculquicondor commented Sep 9, 2024

dom4ha commented Sep 17, 2024

sanposhiho commented Sep 17, 2024 • edited Loading

sanposhiho commented Oct 1, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Oct 9, 2024

wojtek-t commented Oct 9, 2024

sanposhiho commented Oct 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Oct 9, 2024

kerthcet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Oct 10, 2024 • edited Loading

alculquicondor left a comment

Choose a reason for hiding this comment

alculquicondor commented Oct 10, 2024

sanposhiho commented Oct 10, 2024

wojtek-t commented Oct 11, 2024

k8s-ci-robot commented Oct 11, 2024

sanposhiho commented Sep 17, 2024 •

edited

Loading

sanposhiho Oct 11, 2024 •

edited

Loading

sanposhiho commented Oct 10, 2024 •

edited

Loading