Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: capacity aware boosting #36

Open
mikouaj opened this issue Apr 19, 2024 · 4 comments
Open

Feat: capacity aware boosting #36

mikouaj opened this issue Apr 19, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@mikouaj
Copy link
Member

mikouaj commented Apr 19, 2024

Community Note

  • Please vote on this issue by adding a 👍 reaction
    to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do
    not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Description

The capacity aware boosting will make the CPU resource boost conditional: the mutating webhook would try to verify if the given POD, with a boosted resources, would be schedulable on a cluster:

  • The result of negative verification may vary depending on the configuration and may include: no boost at all or boost up to available capacity
  • The verification algorithm should take into consideration Cluster Autoscaler operation

This feature requires scheduling algorithm simulation, including node selection and resource check. There is no API for this and real scheduling algorithm is a complex task, so some sort of simplification to produce "good enough" results is needed.

References

@yyvess
Copy link

yyvess commented Jul 19, 2024

On the case we don't want impacting the scheduler, bost only the limit is not sufisant ?
As opposite, for pod that required a bost, updating the request, will impact the scheduler, but that is what is expected.

As resume, not sure too understand on which usecase you want to impact the schedulers but not scale nodes or disable a bost if the pod cannot be scheduled.

Pheraps will be greate to add an option to increase in percentage only the limit, eventually an others to remove limit during the bost.

@mikouaj
Copy link
Member Author

mikouaj commented Jul 31, 2024

@yyvess the use case we try to solve is as follows:

  1. The POD resource requests are increased per StartupCPUBoost config
  2. The scheduler is not able to find a suitable nodes (no capacity) and the POD is unschedulable
  3. (autoscaler path) The Cluster Autoscaler kicks in and provisions new nodes to accommodate boosted PODs
  4. (autoscaler path) The PODs are scheduled on a new nodes
  5. (autoscaler path) The PODs CPU requests are reverted back to original values
  6. (autoscaler path) After some time the Cluster Autoscaler considers nodes as underutilized (as bigger CPU resources were reverted back) and triggers scale-in action
  7. (autoscaler path) The PODs are being evicted from the nodes and rescheduled somewhere else. We start with point 1). This may even repeat in a loop.

With this feature we aim to solve around point 2) - to give a user possibility to decide if CPU boosting can lead to unschedulable PODs.

@mikouaj mikouaj added the enhancement New feature or request label Jul 31, 2024
@yyvess
Copy link

yyvess commented Aug 1, 2024

@mikouaj I understand that point 2 can be an issue.
As you explain to solving this issue isn't easly.
Until it to avoid this case you can allowing to only bost the limit value (and don t touch the request) that should not impact the scheduler and avoid point 2..
But actually is not possible to bost in ÷ only the limit.

Ps:
It can be also interesting to allow during the bost to remove the limit value to to use all node cpu during the bost.

@mikouaj
Copy link
Member Author

mikouaj commented Aug 8, 2024

@yyvess I like the idea of removing limit value during the boost. It sounds obvious now but I have never though about it before, many thanks! I will create a feature to introduce that possibility in a config driven way.

For the resource requests, boosting them is needed to actually guarantee the resources - although it comes with all of the described challenges. Addressing this can be tough but I believe it is still doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants