-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4832: Asynchronous preemption in the scheduler #4833
Conversation
sanposhiho
commented
Sep 7, 2024
- One-line PR description: Add KEP-4832.
- Issue link: Asynchronous preemption in the scheduler #4832
- Other comments:
df7c1a0
to
7204c57
Compare
/cc @hakuna-matatah |
6c24651
to
a0d192c
Compare
a0d192c
to
058dfeb
Compare
Moving the discussion from kubernetes/kubernetes#125491 (comment). I'm mainly concerned that making preemption API calls async will be replaced with setting nominatedNodeName, which IIUC is also an API call (the reservation has to be visible for autoscaler, so cannot be just within the scheduler cache, which I think is assumed in this KEP). Is it a valid concern? If yes, then calling AddNominatedPod (setting nominatedNodeName) async as well maybe could help. I'm not suggesting adding new extension point here though. |
Setting nominatedNodeName requires API call, but we won't do that at PostFilter. We will only do
The reservation has to be visible once the preemption (= actual pod deletion) is done because otherwise CA would make incorrect decision with the preemption target nodes, which is likely low-utilized just after the preemption. |
@alculquicondor Updated the KEP and made some replies to your comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so interesting feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few initial comments from the PRR
because it works with a lot of communication with kube-apiserver. | ||
Even if the scheduler makes the best scheduling result, the binding API might fail after all. | ||
|
||
So, we don't have to pay a special attention to this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully agree with this. We have to pay attention to it, but it's orthogonal to this KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you change the sentence? (From my non-native-speaker eyes, a "special" attention
delivers the nuance.)
Addressed reviews from @wojtek-t. |
I'm generally fine with PRR for Alpha, but you should get SIG approval first. |
@alculquicondor Applied your suggestion. |
- `goroutines_duration_seconds` (w/ label: `operation`): to observe how many preemption goroutines have failed. | ||
- `goroutines_execution_total` (w/ labels: `operation`, `result`): to observe how long each preemption goroutine takes to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what is the difference between the two. Can you review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Sorry, the explanation was the opposite.
@alculquicondor Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is just a logic change to the preemption plugin rather than the framework, I mean the postFilter extension point, right? And I have several other questions:
- Because preemption will also process the filter logic, so is the API calls the bottleneck, compared to the whole consuming time, which helps us to weight whether the complexity worths.
- Will this leads to behavior change for other postFilter plugins? Seems no.
I may missed something, if so, sorry for that.
The scheduler schedules Pods one by one within the scheduling cycle, | ||
and we basically try to reduce the API calls as much as possible to enhance the scheduling cycle throughput. | ||
|
||
The binding cycle is the example for this motivation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binding cycle is more than avoiding API calls, it's a decouple stage based on the assumption that Pod is scheduled, then you can post-perform the actions based on your requirements. but yes, async API calls can improve the throughput.
2. The preemption PostFilter plugin starts the goroutine to make API calls inside, and return success status (= not wait for the goroutine to finish). | ||
3. The preemption plugin blocks the Pod while the preemption routine is in-progress, using PreEnqueue extension point, so that the target Pod won't be retried during this time. | ||
|
||
Then, afterwards the preemption goroutine makes actual API calls to delete victime Pods and set `Pod.Status.NominatedNodeName`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe more details about when to requeue the Pod, what's the indicator? Watching for the pod.Status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change the explanation here a bit (but, don't want to explain a lot on KEP because it's an implementation detail)
So, it'll be like:
- the preemption plugin has a map or something to record which unsched Pod is making the preemption API calls in goroutine.
- While making the API calls in goroutine, the Pod is gated so that it won't be requeued to activeQ/backoffQ.
- When finishing the goroutine, the preemption plugin no longer gate this Pod.
- Pod/delete event comes because of the preemption, triggering the Pod requeueing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Step3 is what I'm lost, it's a async programming, how to know that we finished the evictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- All victim Pods are deleted by the preemption. The preemption plugin no longer gates this Pod at PreEnqueue.
- Pod/delete (victim Pods) arrives at the scheduling queue, that moves the Pods to the activeQ/backoffQ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we may get more details in the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another implementation detail: will the pod be considered "inflight" (for the hints feature) while the routine is running? I think it doesn't have to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they don't have to. Rather they should just ignore all events (by being blocked at preemption plugin's PreEnqueue) while the preemption routine is running.
then the preemption for pod2 may just select the same preemption targets as pod1, | ||
and when pod1 comes back to the scheduling cycle, it (probably) cannot be scheduled on node1 because of pod2. | ||
|
||
But, this isn't an issue because the final result is completely the same as the current scheduler; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different, because with this design, we'll delete the Pods async, during that time window, we may start the next scheduling cycle and snapshot the cluster, in a small cluster, this may race more often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, here explains that it's the same result eventually even when the pod deletion isn't done by the start of next scheduling cycle, right? Is this explanation missing some edge cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, eventually we will get the same result. What I want to highlight is once two preemptions would like to delete the same victims (because of the async eviction), it will lead to a waste of One scheduling cycle and useless evictions. I think it's serious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, did you want to argue the first scheduling cycle for pod2 is a waste, compared to the current scheduler, because if we wait for pod1's preemption to be done before pod2's scheduling, pod2 can go to the node straight away?
I'm not sure if it makes a serious performance difference, or could happen seriously-frequently (unless, again, I'm not missing anything).
Here's a detailed explanation (long story warning!!! as usual 😅)
The scenario that we're talking here is:
- the cluster is super packed.
- pod1's priority < pod2's priority.
- pod1 cannot find the place and triggers the preemption.
- pod2 comes after pod1, pod2 cannot find the place without any pod deletion.
And, the current scheduler behaves:
- pod1 triggers the preemption.
- The scheduler just waits for (1) to be done. (Note that, at this point, the victim Pods just have deletionTimestamp, not truly deleted)
- If victim Pods are truly deleted before the scheduling cycle start, pod2's scheduling cycle starts and pod2 can go straight to node1 where pod1 makes a space with the preemption. If not, pod2 cannot find the place and goes back to the queue.
The proposed scheduler behaves:
- pod1 triggers the preemption.
- pod2's scheduling cycle starts without waiting for (1) to be done. And the scheduler finds that pod2 cannot go anywhere.
- pod2 triggers the preemption and (let's say) selects the same node as pod1.
First, as written at the current scheduler
's third step,
note that, if not, pod2's scheduling cycle is wasted, which is completely the same as the proposed scheduler.
So, even the current scheduler could waste one scheduling cycle, in the first place.
Rather, actually, given it usually takes some time for Pods to be truly deleted, it's more likely to waste, than not to waste.
But, okay, let's consider what would happen in case the current scheduler doesn't waste; meaning all victim Pods are truly deleted luckily before the pod2's scheduling cycle start.
In this case, let's say API calls for the preemption take 1 second to be all done.
Then, the time that the scheduler takes to schedule pod2 finally is 1 second + the victim Pods are all truly deleted + the time for one scheduling cycle
.
Given "the time for one scheduling cycle" is very small, we can roughly say it's 1 second + the victim Pods are all truly deleted
.
Then, when it comes to the proposed scheduler, it'll be 1 second + max{(the victim Pods are all truly deleted), (the backoff time for pod2)} + the time for _two_ scheduling cycles.
Given, again "the time for one scheduling cycle" is very small, we can roughly say it's 1 second + max{(the victim Pods are all truly deleted), (the backoff time for pod2)}
.
So, as a result, a rough time comparison between two schedulers would be:
- The current scheduler:
1 second + the victim Pods are all truly deleted
- The new scheduler:
1 second + max{(the victim Pods are all truly deleted), (the backoff time for pod2)}
.
So, if (the victim Pods are all truly deleted) > (the backoff time for pod2), then no time difference.
Also, I believe the time difference would be small, even if there is.
To summarize, if you want to observe the current scheduler is faster for pod2 than the proposed scheduler, all the following conditions have to be met:
General condition:
- the cluster is super packed.
- pod1 cannot find the place and triggers the preemption.
- pod2 comes after pod1, pod2 cannot find the place too without any pod deletion.
The current scheduler:
- pod2 has to come to the activeQ before pod1 starts a second scheduling cycle, but after all victim Pods (deleted by pod1's preemption) are truly deleted.
The proposed scheduler:
- pod2 selects the same node as pod1 in its preemption.
- pod2's backoff time is longer than the time for the victim Pods to all truly deleted.
When you're so fortunate that all the conditions are met, you can finally see the performance degradation and the difference is (the backoff time for pod2) - (the victim Pods are all truly deleted)
, which should be not that big difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, looking back at my reply just made, I found my reply too long. 😅
So, in very short, I guess you missed that, even with the current scheduler, the Pods aren't truly deleted after the preemption (because the preemption just puts the deletionTimestamp on victim Pods).
- The current scheduler could likely make a one wasted scheduling cycle for pod2, like the proposed scheduler.
- Even if it doesn't make luckily, the time difference made by the wasted scheduling cycle in the proposed scheduler would be likely small.
Yes, there's, again, many assumptions ("likely") though, in the first place, this scenario that we're talking itself is also not something that very frequently happens in many clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Kensei, it's a long typing. With the consideration of the maturity of Kubernetes today, I believe stability is more important than new features, so I'm more concerned about what side effects would be introduced into the kubernetes. What I mentioned above is one side, as the same victim would be chosen for two different preemptions, another one is the nominator info would be stale in the cache, for example the eviction API calls are failed, both of them may lead to chaos (I hope I worried too much)
However, I think there's no reason to block this goes into the alpha stage, and Aldo already LGTMed, but when graduating to Beta, I hope we can take these risks into consideration, so LGTM as well.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update the risk to include the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same victim would be chosen for two different preemptions
I do not consider it as a risk both from a functional perspective and a performance perspective
as the KEP and my above too-long comment explain it's safe even if two goroutines select the same Pod(s) as a preemption victim.
another one is the nominator info would be stale in the cache, for example the eviction API calls are failed
Yes, it is a risk in case kube-apiserver (or network etc) is unstable and the API call fails. And, we already mentioned it in the risk section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't argue more on this since we're all aware of the case but if you're evicting a bunch of Pods however this wouldn't make the preemptor Pod schedulable, or even worse make the cluster less balance, I would take it a risk. We do have similar situations right now (we do not guarantee the preemptor pod will be schedulable again), but we're exacerbating this.
Yes.
Yes.
No. It's just within our in-tree preemption plugin. Although, I initially proposed to create a new extension point (see the alternative section) so that other people can do the similar though, we decided to scope out it, at least for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
wonderful
/lgtm |
/lgtm @sanposhiho - I suggest filling-in an exception request and blaming me for PRR delay of couple hours. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, sanposhiho, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |