Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #3990

Open
4 tasks
sanposhiho opened this issue May 6, 2023 · 29 comments
Open
4 tasks

PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #3990

sanposhiho opened this issue May 6, 2023 · 29 comments
Assignees
Labels
sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@sanposhiho
Copy link
Member

sanposhiho commented May 6, 2023

Enhancement Description

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

/sig scheduling
/assign

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label May 6, 2023
@sanposhiho
Copy link
Member Author

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label May 6, 2023
@knelasevero
Copy link

@sanposhiho do you have ETA for the KEP to be up? Any way we could help?

/cc @a7i

@sanposhiho
Copy link
Member Author

I recently started a draft locally, but either way, as written at the top, we - sig-scheduling - don't plan to have this enhancement in v1.28 and gonna be in v1.29 at the earliest. You can help us improve the design once I've created the PR for KEP.

@sanposhiho
Copy link
Member Author

sanposhiho commented Jul 20, 2023

@ahg-g @alculquicondor @Huang-Wei
Can we target this towards v1.29?
I'll write the KEP + take the implementation part, but looking for main reviewer(s) who can work on in the next release cycle.

@alculquicondor
Copy link
Member

I'm in favor, but make sure you also have a reviewer from sig-autoscaling

@sanposhiho
Copy link
Member Author

Thanks @alculquicondor.

@gjtempleton @mwielgus
Hello. Could either of you help us as a reviewer on this enhancement?
The cluster autoscaler is relevant to this enhancement, like CA tells the scheduler that it cannot bring any new Nodes in and the scheduler decides to fallback. (See kubernetes/kubernetes#105977 for more details)

@MaciekPytel
Copy link

Hi, I've been thinking about this for a long time and as a top-level owner of Cluster Autoscaler I'd be happy to get involved.

That being said - I'd like to start discussion with a possibly very controversial question: is scheduling the best layer to address zonal spreading?


Problem statement

Today I would recommend against anyone using PodTopologySpreading on a zonal topology in a cloud environment:

  • When using whenUnsatisfiable: DoNotSchedule topology spreading introduces two major problems:
    • In the event of provider being out of capacity (stock-out) pods would not be able to schedule at all - they can't schedule in the zone with smallest number of replicas, because there is no capacity there. And they can't schedule in other zones, because it would violate the topology spreading constraints.
    • During zone failure, topology spreading will prevent pods from affected nodes from falling back to other zones unless every single node in an affected zone is removed. In my experience partial failures are more common than complete failure where every single node is gone.
  • The alternative is to set whenUnsatisfiable: ScheduleAnyway, but in an autoscaled cluster this is unlikely to achieve much:
    • Autoscaling only looks at Filters and ignores Scores, so this will not affect in which zones new nodes are provided.
    • Autoscaling is only triggered by unschedulable pods, so in practice the scheduler will put the pods wherever there is free capacity and new nodes will not be provided at all. Unless cluster utilization is quite low, chances are that "wherever there is free capacity" doesn't result in pods being spread out between zones.

My understanding is that the goal of this proposal is to address the problems with whenUnsatisfiable: DoNotSchedule spreading that I described above. Am I correct?


Challenges

I'd love to see this solved, but I don't have any good ideas on how to do it. A timeout-based approach seems very fragile as node provisioning times vary widely between clouds and even within a single cloud they may be very different based on the type of hardware being used (ex. nodes with GPU often take more time to startup and initialize). And what happens when preferred instance types are unavailable (ex. stockout) and autoscaler needs to fallback to a different type - that would add anywhere between a few seconds to 15+ minutes of extra latency depending on the cloud and the exact reason for node creation failure.
We can say that it's up to the user to set the right timeout, but how many users will be able to set the right timeout in practice? I've been working on Cluster Autoscaler in OSS and GKE for >6 years and I have no idea what a good timeout value would be on GCP - much less in any other cloud.

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler, but for that I think we should have an idea how to support this in Autoscaler: how do we make it aware that it should update the pod at all? Today Autoscaler just imports scheduler code and runs PreFilters/Filters, without any understanding of what they actually check. How would it know that a pod is unschedulable because of TopologySpreading and not some other constraint? - and I mean the question in the sense of how much of Autoscaler would we have to refactor, not just a high-level conceptual answer.


Alternatives that may be worth discussing

Finally, even if we solve all of those issues - scheduling pods in a way that respects topology spreading constraint still wouldn't guarantee equal spreading between zones. If one zone is temporarily stocked-out, the scheduler will not be able to restore the spreading after the instances become available again.

That brings me to the controversial question: wouldn't it be better to solve this at Deployment/ReplicaSet/StatefulSet/etc level instead? Any such controller could target pod to specific zones (ex. by setting nodeSelector for different zones) and it could continuously reconcile the number of pods in each zone. This would also address the problem of timeout - we could fallback to a different zone after a relatively short timeout, knowing that we can always restore the spreading as soon as the capacity becomes available.

This is the approach taken by https://github.com/kubernetes/autoscaler/blob/master/balancer/proposals/balancer.md proposal. I'm not sure balancer is the best way to implement this either - I'm not as familiar with whatever challenges this approach may be facing. But I think it would be good to start discussion from agreeing on what problems we're trying to solve and evaluating which component could best solve them, before jumping into any particular implementation.

@sanposhiho
Copy link
Member Author

sanposhiho commented Jul 26, 2023

Thanks @MaciekPytel for getting involved!

I believe we should continue to discuss a detailed design in KEP PR instead of here, but let me roughly answer your questions.

In the issue, we're considering having new whenUnsatisfiable, DoNotSchedule_BestEffort and PreemptOrSchedule. (names can be officially decided in KEP)
DoNotSchedule_BestEffort is the one CA will be related.
When DoNotSchedule_BestEffort, the scheduler keeps trying to schedule the Pod while assuming TopologySpread is DoNotSchedule. But, when the scheduler somehow knows that CA cannot create Node for this Pod, then fallback to ScheduleAnyway so that the Pod will get scheduled without being blocked by PodTopologySpread.

Then, next it comes to how to know that CA cannot create Node for the Pod. We discussed two options:

  • waits a certain period of time (it's like a timeout). If new Node isn't created for a certain period, then fallback to ScheduleAnyway. (ref)
  • create new Pod condition TriggeredScaleUp and CA set it false if it cannot create new Node for the Pod. (ref)

In the issue, we kind of concluded to prefer option 2, new Pod condition. So, CA's responsibility is to add false to new Pod condition TriggeredScaleUp. The reason was, as you said, it's difficult to set an appropriate timeout. (ref)

It's the current status of our discussion, and I'm going to create the KEP based on this. So, answering your questions

My understanding is that the goal of this proposal is to address the problems with whenUnsatisfiable: DoNotSchedule spreading that I described above. Am I correct?

Yes. Exactly correct.

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler

And yes it's exactly what we concluded (at least in the issue).

How would it know that a pod is unschedulable because of TopologySpreading and not some other constraint?

I believe, to make it simple, CA doesn't need to do something special for TopologySpread.
It only needs to set the condition when it cannot create a node for unschedulable Pod, regardless of the reason. And when the scheduler sees that condition, fallback to SchedulingAnyway in anyways. If unschedulable isn't caused by TopologySpread, the Pod should still be unschedulable even after the fallback.

If one zone is temporarily stocked-out, the scheduler will not be able to restore the spreading after the instances become available again.

Such "rescheduling", "rebalancing" is the responsibility of descheduler, not the scheduler. So, we don't need to concern much about the rebalancing in the scheduler.

@MaciekPytel
Copy link

I believe we should continue to discuss a detailed design in KEP PR instead of here

Sounds good to me. Please tag me on PR and also please feel free to ping me on slack if you want to discuss any CA-releated parts.

@sanposhiho
Copy link
Member Author

@alculquicondor @MaciekPytel Can we target this enhancement to v1.29? (based on your bandwidth?)

@sanposhiho
Copy link
Member Author

We skipped this v1.29 release. Let's aim at the next one hopefully.
We'll discuss it in an upcoming SIG-Autoscaling meeting. (Oct 23rd)

@tzneal
Copy link
Contributor

tzneal commented Oct 13, 2023

This handles one part of the scheduling problem (an autoscaler is unable to launch new capacity) but doesn't handle the case where an autoscaler launches new capacity, but its degraded in some way (e.g. node is ready, but all pods that schedule to the new node fail due to some other sort of issue affecting the topology domain). Has there been any thoughts on allowing overriding scheduling restrictions during gray failures?

When thinking on it, I was considering a CRD that a user could create to indicate to the scheduler/autoscaler/anyone else that a particular topology domain is now invalid and shouldn't count for topology spread purposes, and no pods should be scheduled to that domain. Autoscalers could read the same CRD and avoid attempting to scale up nodes in that domain as well.

There are a few other advantages to being able to imperatively indicate to multiple consumers that a topology domain is bad:

  • maybe I know before the scale-up occurs (e.g. my cloud provider told me that an AZ is having issues) and I can avoid the launch and delay by marking that topology domain
  • future integration with something like descheduler to move my existing pods out of the affected topology domain

@ellistarn
Copy link

ellistarn commented Oct 13, 2023

Replicaset Spread

That brings me to the controversial question: wouldn't it be better to solve this at Deployment/ReplicaSet/StatefulSet/etc level instead? Any such controller could target pod to specific zones (ex. by setting nodeSelector for different zones) and it could continuously reconcile the number of pods in each zone. This would also address the problem of timeout - we could fallback to a different zone after a relatively short timeout, knowing that we can always restore the spreading as soon as the capacity becomes available.

I quite like this line of thinking. To add to it (though you may be implying this already), it could continue to be part of the podtemplatespec, but would result in the replicaset controller applying additional corresponding nodeSelectors to the physical pods it creates. You could even re-use the topology spread constraints API surface, and just shift the spread responsibilities to replicaset, instead of scheduler.

To be explicit:

  1. user creates a deployment, w/ topologyspreadconstraints in the podtemplatespec
  2. deployment creates a replicaset w/ the full podtemplatespec
  3. replicaset controller applies additional nodeSelectors to each pod, in an effort to enforce spread
  4. replicaset controller could continually reconcile pods to maintain a balance
  5. kube-scheduler (and autoscalers) simply don't need to think about topologyspreadconstraints

Scheduler / Autoscaler coordination

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler,

I've often wished that the scheduler and autoscaler were the same component, as it unlocks the ability to make and enforce decisions in the same component, avoiding race conditions like the this KEP attempts to address. Of course, there are ways to communicate these decisions between systems, but communication protocols are hard (Kube API Server objects, or otherwise). This is most likely a dead end, given where Kubernetes is today, but given that @MaciekPytel is opening up controversial questions, I figure I might throw this one into the ring ;)

Topology API Object

When thinking on it, I was considering a CRD that a user could create to indicate to the scheduler that a particular topology domain is now invalid and shouldn't count for topology spread purposes, and no pods should be scheduled to that domain. Autoscalers could read the same CRD and avoid attempting to scale up nodes in that domain as well.

This would be very useful to achieve usecases like "disable this AZ while we ride out this outage" aws/karpenter-provider-aws#4727

@sanposhiho
Copy link
Member Author

sanposhiho commented Oct 14, 2023

node is ready, but all pods that schedule to the new node fail due to some other sort of issue affecting the topology domain

The cluster autoscaler should take all scheduling constraints into consideration when they do the simulation.
If it does properly, that kind of scenario (Node is created, but the unsched Pod cannot get scheduled to that Node actually) shouldn't happen.
Or, in some minor scenario (like higher priority Pod is created very recently, after the CA's reconciliation), the Pod may be still unschedulable again after the first CA's reconciliation. But, that should be resolved in the next reconciliation of CA.

the scheduler/autoscaler/anyone else that a particular topology domain is now invalid

Giving taints to such Nodes in domain looks enough to me. Or do you have any argument that the taints cannot play such role? Topology Spread takes taints into consideration (ref).

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler,

I've often wished...

That "communication mechanism" is the current design of this KEP. We give a new condition to the Pod, the cluster autoscaler gives TriggeredScaleUp: false to the Pod condition when they cannot get a Node for that unsched Pod (due to stockout etc), and the scheduler does the fallback if the Pod has that condition. Please check out the current draft ↓, and let's bring technical discussion there if any.
#4150


So, it's the simplest and works well that "unschedulable Pods" are always the medium of communication between kube-scheduler and the cluster autoscaler. Introducing another CRD or something, it'd introduce something makes things complicated.
That's why I'm proposing the current design - introduce a new condition in Pod, in which the cluster autoscaler can tell the scheduler that they couldn't make any Node for the unsched Pod.
So, we can say putting PodScheduled: false condition to the Pod is the request from the scheduler to the CA, and a new condition on Pods would be considered to be a response from the CA.

@tzneal
Copy link
Contributor

tzneal commented Oct 16, 2023

node is ready, but all pods that schedule to the new node fail due to some other sort of issue affecting the topology domain

That's not the situation I'm describing. The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain which is sufficient to break workloads, but still allows nodes to launch and go Ready. In that case, this proposal doesn't help as the autoscaler can happily continue to create nodes which appear functional, but are not.

Giving taints to such Nodes in domain looks enough to me. Or do you have any argument that the taints cannot play such role? Topology Spread takes taints into consideration (#3094).

For the gray failure situation, I need to:

  • taint all nodes to inform the scheduler
  • reconfigure any autoscalers to avoid the topology domain
  • inform any other components that might need to be aware

There's no common method I can use to inform every interested party that "for right now topology domain X is bad, change your decision making accordingly".

@sanposhiho
Copy link
Member Author

sanposhiho commented Oct 17, 2023

I'm not sure here's the right place to talk about your story then. This enhancement is how we do fallback when scheduling keeps failing due to a required topology spread. You are talking about the domain failure, which is invisible from the scheduler. (unless we give taints manually)
I interpret your challenge to be how to make such failure visible to components (scheduler, CA, etc)

Can you create another issue in k/k to discuss about your user-story there? Then, can you elaborate your story more in that issue?

I'm wondering if stories in your mind sre possible to automatically detect such situation.
If possible,

  • give taints to nodes in the domain
  • not give ready to new nodes in the domain

would be the only necessary improvements.

@tzneal
Copy link
Contributor

tzneal commented Oct 17, 2023

when scheduling keeps failing due to a required topology spread

In my thinking, that is a subset of the larger problem "a topology domain is no longer viable in some way". Solving that one would solve the "inability to schedule", while also handling the "can schedule, but it won't work if it does".

I'm wondering if stories in your mind are possible to automatically detect such situation.

It could be automated by some other decoupled component, e.g. your cloud provider sends a notification that a controller receives and then creates a "zone-A is invalid" object which every interested party consumes.

@sanposhiho
Copy link
Member Author

Please note that the scheduler's responsibility is to schedule pods based on visible status.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

So, what you are saying is the lack of visibility to scheduler. If the scheduler could understand something wrong in the domain which would block pod startup, the pods would not go to that domain.
That visibility improvement might be achieved by only taints like I said, or by a new CRD like you said, etc... we don't know yet.

But, this KEP tries to do fallback to solve problems described in a draft KEP. That's it. It doesn't try to strengthen the ability to notice domain failures. That is completely out-of-scope.

That's why I want to distinguish this KEP and the problem you have.

So, sorry again, could you please create another issue in k/k with the specific case that ↓ could happen. I still don't get what exact scenario you have in your mind. We can discuss what we need to improve based on that in that new issue, not here.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

@tzneal
Copy link
Contributor

tzneal commented Oct 17, 2023

Please note that the scheduler's responsibility is to schedule pods based on visible status.

Yes, I think a proposal of having a node autoscaler make its inability to launch a node visible to the scheduler to it can apply scheduling rules differently is similar but not as expressive as directly marking a topology domain as invalid for the scheduler and other consumers.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

So, what you are saying is the lack of visibility to scheduler. If the scheduler could understand something wrong in the domain which would block pod startup, the pods would not go to that domain. That visibility improvement might be achieved by only taints like I said, or by a new CRD like you said, etc... we don't know yet.

It's a superset of the "autoscaler can't launch a node", and I think its a more common issue. You could also solve "autoscaler can't launch a node" by tainting all of the nodes in the problem domain and using a nodeTaintsPolicy=Honor on the topology spread constraints.

But, this KEP tries to do fallback to solve problems described in a draft KEP. That's it. It doesn't try to strengthen the ability to notice domain failures. That is completely out-of-scope.

That's why I want to distinguish this KEP and the problem you have.

To be clear, I'm not arguing for noticing domain failures. I just want a mechanism for users to be able to handle them without updating all of their workloads with new node affinities to avoid the problem domain. My argument is to push the KEP towards solving the larger problem.

So, sorry again, could you please create another issue in k/k with the specific case that ↓ could happen. I still don't get what exact scenario you have in your mind. We can discuss what we need to improve based on that in that new issue, not here.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

@sanposhiho
Copy link
Member Author

sanposhiho commented Oct 18, 2023

It's too vague to talk here. Could you create a new draft KEP PR as alternative solution then? You can associate this KEP number and, for now, don't need to fill in all parts, but in only some core parts describing the design. Then we can compare two draft PRs based on them.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2024
@alculquicondor
Copy link
Member

@sanposhiho are you still pushing for this in 1.30?

@sanposhiho
Copy link
Member Author

I'm working on the investigation on CA side though, v1.30 is nearly impossible.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 1, 2024
@sanposhiho
Copy link
Member Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 2, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2024
@sanposhiho
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2024
@sanposhiho
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
Status: Needs Triage
Development

No branches or pull requests

8 participants