Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(helm): add support for vpa #4210

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

feat(helm): add support for vpa #4210

wants to merge 1 commit into from

Conversation

t3mi
Copy link
Contributor

@t3mi t3mi commented Aug 22, 2024

What this PR does / why we need it:
This PR adds support the the vertical pod autoscaler so users do not have to care about the resource usage of the controller and let the vertical pod autoscaler to do its job.

How does this PR make you feel:
gif

If applicable:

  • this PR contains documentation
  • this PR contains tests
  • this PR contains YAML Samples

Copy link
Member

@matthchr matthchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: Did you see an actual need for VPA-ing the ASO pod? Do you just VPA everything in your clusters?

I'm most interested in if this is useful to you because you want VPA to scale the pod down (default requests are too high) or you want VPA to scale the pod up (default requests are too low, or limits too low).

controlledResources:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.verticalPodAutoscaler.controlledValues }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Can .Values.verticalPodAutoscaler.controlledValues ever be unspecified in the values.yaml? It seems like probably not? Is the if-guard here really needed then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they could. Default would be RequestsAndLimits value.

@t3mi
Copy link
Contributor Author

t3mi commented Aug 28, 2024

I'm curious: Did you see an actual need for VPA-ing the ASO pod?

Yes as we have multiple clusters with different operator load: some teams are heavily using ASO, some - just few resources to try and in the end we need to have a cost effective solution for "middleware" without a toil for picking correct resources values per cluster/team.

Do you just VPA everything in your clusters?

We're trying to use it for most of the components if it supported in the charts.

I'm most interested in if this is useful to you because you want VPA to scale the pod down (default requests are too high) or you want VPA to scale the pod up (default requests are too low, or limits too low).

Depends on the application - for ASO it would be beneficial to be used in both directions as it's not that critical and can tolerate a restart.

@matthchr
Copy link
Member

Sorry this has been lingering - will review it shortly!

@matthchr
Copy link
Member

Synced with @theunrepentantgeek and @super-harsh on this topic and we have concerns that running VPA on ASO as it stands today may cause more harm than good. The reasoning is:

  1. ASO runs a single replica by default, and when that replica is down webhooks won't work which causes a degraded user experience in-cluster. I've got an item to improve this here: Change default pod count to 2 #4215
  2. VPA restarts the pod when it scales, but ASO re-syncs with Azure and re-reconciles resources after a restart, which means that VPA-ing ASO may optimize your cluster memory/cpu usage but at the expense of sending more requests to Azure, which may cause subscription throttling or other problems (depending on how many resources you have in the cluster). A restart here or there is fine but regular restarts could be problematic. We need to understand more about how frequent VPA restarts can be to better understand the impact/risk on Azure subscription throttling.
  3. We haven't seen any evidence that the fixed limits we have defined for ASO are especially problematic. If you have specific data on average CPU/memory you'd like to share with us I'd love to see it. In particular, it seems unlikely that VPA would scale ASO out past the default limits, which are quite generous for how much CPU/memory the pod actually uses.

With the above said, our current stance on this PR is to hold off on merging it at least for now. We'll re-evaluate in a few weeks when I get #4215 done and possibly have more time to play around with VPA configurations.

In the meantime, if you want to run VPA with ASO, do so with caution: but you should be able to do so even without Helm chart integration because all you need to do is create the autoscaling.k8s.io/v1 resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

2 participants