Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix Cluster Autoscaler Evictions #100

Closed
wants to merge 2 commits into from
Closed

Conversation

frctnlss
Copy link

The Kubernetes Cluster Autoscaler evicts pods on scale-down events based on a number of criteria. This PR makes the default behavior of new work pools, based on the Kubernetes worker, to prevent eviction.

The option to disable this setting is added for situations like GKE Autopilot which has a max number of running pods with the required annotation. This also means the backoff limit must be increased to accommodate restarts. Prefect will otherwise mark the Flow as Crashed when the backoff limit is reached.

Checklist

  • References any related issue by including "Closes #" or "Closes ".
    • If no issue exists and your change is not a small fix, please create an issue first.
  • Includes tests or only affects documentation.
  • Passes pre-commit checks.
    • Run pre-commit install && pre-commit run --all locally for formatting and linting.
  • Includes screenshots of documentation updates.
    • Run mkdocs serve view documentation locally.
  • Summarizes PR's changes in CHANGELOG.md

@frctnlss frctnlss requested a review from a team as a code owner November 13, 2023 21:12
"metadata": {
"annotations": {
"cluster-autoscaler.kubernetes.io/"
"safe-to-evict": "{{ safe_to_evict }}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line separation was completed due to conflicting rules between black and flake8. Reference the docs here. I am open to alternative options, but the options I saw were: the line break, increasing the line length arbitrarily, or ignoring the E501 rule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to add a comment to ignore the rule!

Copy link
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @frctnlss! I would prefer to set the cluster-autoscaler.kubernetes.io/safe-to-evict to False by default and leave out the new safe_to_evict and backoff_limit variables. Those values will still be editable, but I don't think they should be default variables, especially because increasing the backoff limit can have strange side-effects when used with Prefect retries.

@desertaxle
Copy link
Member

Closing this PR because it appears to be stale. Feel free to reopen and make the requested changes if this functionality is still needed.

@desertaxle desertaxle closed this Mar 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants