-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #3990
Comments
/sig autoscaling |
@sanposhiho do you have ETA for the KEP to be up? Any way we could help? /cc @a7i |
I recently started a draft locally, but either way, as written at the top, we - sig-scheduling - don't plan to have this enhancement in v1.28 and gonna be in v1.29 at the earliest. You can help us improve the design once I've created the PR for KEP. |
@ahg-g @alculquicondor @Huang-Wei |
I'm in favor, but make sure you also have a reviewer from sig-autoscaling |
Thanks @alculquicondor. @gjtempleton @mwielgus |
Hi, I've been thinking about this for a long time and as a top-level owner of Cluster Autoscaler I'd be happy to get involved. That being said - I'd like to start discussion with a possibly very controversial question: is scheduling the best layer to address zonal spreading? Problem statement Today I would recommend against anyone using PodTopologySpreading on a zonal topology in a cloud environment:
My understanding is that the goal of this proposal is to address the problems with Challenges I'd love to see this solved, but I don't have any good ideas on how to do it. A timeout-based approach seems very fragile as node provisioning times vary widely between clouds and even within a single cloud they may be very different based on the type of hardware being used (ex. nodes with GPU often take more time to startup and initialize). And what happens when preferred instance types are unavailable (ex. stockout) and autoscaler needs to fallback to a different type - that would add anywhere between a few seconds to 15+ minutes of extra latency depending on the cloud and the exact reason for node creation failure. An alternative would be some sort of communication mechanism between Autoscaler and Scheduler, but for that I think we should have an idea how to support this in Autoscaler: how do we make it aware that it should update the pod at all? Today Autoscaler just imports scheduler code and runs PreFilters/Filters, without any understanding of what they actually check. How would it know that a pod is unschedulable because of TopologySpreading and not some other constraint? - and I mean the question in the sense of how much of Autoscaler would we have to refactor, not just a high-level conceptual answer. Alternatives that may be worth discussing Finally, even if we solve all of those issues - scheduling pods in a way that respects topology spreading constraint still wouldn't guarantee equal spreading between zones. If one zone is temporarily stocked-out, the scheduler will not be able to restore the spreading after the instances become available again. That brings me to the controversial question: wouldn't it be better to solve this at Deployment/ReplicaSet/StatefulSet/etc level instead? Any such controller could target pod to specific zones (ex. by setting nodeSelector for different zones) and it could continuously reconcile the number of pods in each zone. This would also address the problem of timeout - we could fallback to a different zone after a relatively short timeout, knowing that we can always restore the spreading as soon as the capacity becomes available. This is the approach taken by https://github.com/kubernetes/autoscaler/blob/master/balancer/proposals/balancer.md proposal. I'm not sure balancer is the best way to implement this either - I'm not as familiar with whatever challenges this approach may be facing. But I think it would be good to start discussion from agreeing on what problems we're trying to solve and evaluating which component could best solve them, before jumping into any particular implementation. |
Thanks @MaciekPytel for getting involved! I believe we should continue to discuss a detailed design in KEP PR instead of here, but let me roughly answer your questions. In the issue, we're considering having new Then, next it comes to how to know that CA cannot create Node for the Pod. We discussed two options:
In the issue, we kind of concluded to prefer option 2, new Pod condition. So, CA's responsibility is to add It's the current status of our discussion, and I'm going to create the KEP based on this. So, answering your questions
Yes. Exactly correct.
And yes it's exactly what we concluded (at least in the issue).
I believe, to make it simple, CA doesn't need to do something special for TopologySpread.
Such "rescheduling", "rebalancing" is the responsibility of descheduler, not the scheduler. So, we don't need to concern much about the rebalancing in the scheduler. |
Sounds good to me. Please tag me on PR and also please feel free to ping me on slack if you want to discuss any CA-releated parts. |
@alculquicondor @MaciekPytel Can we target this enhancement to v1.29? (based on your bandwidth?) |
We skipped this v1.29 release. Let's aim at the next one hopefully. |
This handles one part of the scheduling problem (an autoscaler is unable to launch new capacity) but doesn't handle the case where an autoscaler launches new capacity, but its degraded in some way (e.g. node is ready, but all pods that schedule to the new node fail due to some other sort of issue affecting the topology domain). Has there been any thoughts on allowing overriding scheduling restrictions during gray failures? When thinking on it, I was considering a CRD that a user could create to indicate to the scheduler/autoscaler/anyone else that a particular topology domain is now invalid and shouldn't count for topology spread purposes, and no pods should be scheduled to that domain. Autoscalers could read the same CRD and avoid attempting to scale up nodes in that domain as well. There are a few other advantages to being able to imperatively indicate to multiple consumers that a topology domain is bad:
|
Replicaset Spread
I quite like this line of thinking. To add to it (though you may be implying this already), it could continue to be part of the podtemplatespec, but would result in the replicaset controller applying additional corresponding nodeSelectors to the physical pods it creates. You could even re-use the topology spread constraints API surface, and just shift the spread responsibilities to replicaset, instead of scheduler. To be explicit:
Scheduler / Autoscaler coordination
I've often wished that the scheduler and autoscaler were the same component, as it unlocks the ability to make and enforce decisions in the same component, avoiding race conditions like the this KEP attempts to address. Of course, there are ways to communicate these decisions between systems, but communication protocols are hard (Kube API Server objects, or otherwise). This is most likely a dead end, given where Kubernetes is today, but given that @MaciekPytel is opening up controversial questions, I figure I might throw this one into the ring ;) Topology API Object
This would be very useful to achieve usecases like "disable this AZ while we ride out this outage" aws/karpenter-provider-aws#4727 |
The cluster autoscaler should take all scheduling constraints into consideration when they do the simulation.
Giving taints to such Nodes in domain looks enough to me. Or do you have any argument that the taints cannot play such role? Topology Spread takes taints into consideration (ref).
That "communication mechanism" is the current design of this KEP. We give a new condition to the Pod, the cluster autoscaler gives So, it's the simplest and works well that "unschedulable Pods" are always the medium of communication between kube-scheduler and the cluster autoscaler. Introducing another CRD or something, it'd introduce something makes things complicated. |
That's not the situation I'm describing. The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain which is sufficient to break workloads, but still allows nodes to launch and go Ready. In that case, this proposal doesn't help as the autoscaler can happily continue to create nodes which appear functional, but are not.
For the gray failure situation, I need to:
There's no common method I can use to inform every interested party that "for right now topology domain X is bad, change your decision making accordingly". |
I'm not sure here's the right place to talk about your story then. This enhancement is how we do fallback when scheduling keeps failing due to a required topology spread. You are talking about the domain failure, which is invisible from the scheduler. (unless we give taints manually) Can you create another issue in k/k to discuss about your user-story there? Then, can you elaborate your story more in that issue? I'm wondering if stories in your mind sre possible to automatically detect such situation.
would be the only necessary improvements. |
In my thinking, that is a subset of the larger problem "a topology domain is no longer viable in some way". Solving that one would solve the "inability to schedule", while also handling the "can schedule, but it won't work if it does".
It could be automated by some other decoupled component, e.g. your cloud provider sends a notification that a controller receives and then creates a "zone-A is invalid" object which every interested party consumes. |
Please note that the scheduler's responsibility is to schedule pods based on visible status.
So, what you are saying is the lack of visibility to scheduler. If the scheduler could understand something wrong in the domain which would block pod startup, the pods would not go to that domain. But, this KEP tries to do fallback to solve problems described in a draft KEP. That's it. It doesn't try to strengthen the ability to notice domain failures. That is completely out-of-scope. That's why I want to distinguish this KEP and the problem you have. So, sorry again, could you please create another issue in k/k with the specific case that ↓ could happen. I still don't get what exact scenario you have in your mind. We can discuss what we need to improve based on that in that new issue, not here.
|
Yes, I think a proposal of having a node autoscaler make its inability to launch a node visible to the scheduler to it can apply scheduling rules differently is similar but not as expressive as directly marking a topology domain as invalid for the scheduler and other consumers.
It's a superset of the "autoscaler can't launch a node", and I think its a more common issue. You could also solve "autoscaler can't launch a node" by tainting all of the nodes in the problem domain and using a
To be clear, I'm not arguing for noticing domain failures. I just want a mechanism for users to be able to handle them without updating all of their workloads with new node affinities to avoid the problem domain. My argument is to push the KEP towards solving the larger problem.
|
It's too vague to talk here. Could you create a new draft KEP PR as alternative solution then? You can associate this KEP number and, for now, don't need to fill in all parts, but in only some core parts describing the design. Then we can compare two draft PRs based on them. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@sanposhiho are you still pushing for this in 1.30? |
I'm working on the investigation on CA side though, v1.30 is nearly impossible. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Enhancement Description
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.
/sig scheduling
/assign
The text was updated successfully, but these errors were encountered: