Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore node_count and initial_node_count when late initializing NodePools #353

Conversation

toastwaffle
Copy link

Having node_count late-initialized with autoscaling can cause the autoscaler and Crossplane to fight over how many nodes there should be (with Crossplane effectively reverting autoscaler changes). If autoscaling is used, node_count should remain unset.

As documented, initial_node_count may (or may not?) change if the node count is manually changed (via gcloud or the Cloud Console). This can cause resources to fail reconciliation because terraform wants to destroy and recreate the node pool. If the node count is changed manually, either Crossplane should correct it back to the value of an explicitly set node_count field, or it should be ignored because autoscaler is in use (in which case node_count should be unset).

Fixes #340

I have:

  • Run make reviewable test to ensure this PR is ready for review (sort of - the linter was OOMing, but the tests pass when run separately).

How has this code been tested

Manually exercised with make run against minikube by creating a Cluster and Node Pool with autoscaling enabled, and manually (in Cloud Console) setting the node count to a high number such that the autoscaler scales it back down (which is how I found the issue with initial_node_count beyond the main issue in #340). I repeated this process a few times, and did not see any errors or attempts to undo the autoscaler's actions.

…ools

Having node_count late-initialized with autoscaling can cause the autoscaler and
Crossplane to fight over how many nodes there should be (with Crossplane
effectively reverting autoscaler changes). If autoscaling is used, node_count
should remain unset.

As documented in [0], initial_node_count may (or may not?) change if the node
count is manually changed (via gcloud or the Cloud Console). This can cause
resources to fail reconciliation because terraform wants to destroy and recreate
the node pool. If the node count is changed manually, either Crossplane should
correct it back to the value of an explicitly set node_count field, or it should
be ignored because autoscaler is in use.

[0]: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool#initial_node_count

Signed-off-by: Samuel Littley <[email protected]>
@Upbound-CLA
Copy link

Upbound-CLA commented Aug 9, 2023

CLA assistant check
All committers have signed the CLA.

@turkenf
Copy link
Collaborator

turkenf commented Aug 9, 2023

/test-examples="examples/container/nodepool.yaml"

@toastwaffle
Copy link
Author

@turkenf this goes beyond what was suggested in #340, and I'm not 100% sure that ignoring initial_node_count is actually correct. I'd appreciate your thoughts.

I will hopefully get the CLA signed tomorrow; I need to get it reviewed by my company's legal team.

@turkenf
Copy link
Collaborator

turkenf commented Aug 14, 2023

/test-examples="examples/container/nodepool.yaml"

@turkenf
Copy link
Collaborator

turkenf commented Aug 15, 2023

Hi @toastwaffle, thank you for your contribution.
In v0.35.0, the granular management policies feature was added. This might solve your problem here, can you check this example or docs here?

@toastwaffle
Copy link
Author

I did spot that, but it looked like the feature was marked as Alpha. Do you have a rough idea of how stable the feature is?

Either way, I'll give it a try tomorrow.

@turkenf
Copy link
Collaborator

turkenf commented Aug 15, 2023

I did spot that, but it looked like the feature was marked as Alpha. Do you have a rough idea of how stable the feature is?

@lsviben can better comment here, but AFAIK it is planned to be promoted to beta soon.

@jeanduplessis
Copy link
Collaborator

@toastwaffle, the feature API is considered stable at this point. We plan to mature it to beta in the XP 1.14 release.

@turkenf
Copy link
Collaborator

turkenf commented Sep 25, 2023

@toastwaffle, have you had a chance to try?

@toastwaffle
Copy link
Author

We did try it, and it did solve the node count problem, but we ended up rolling it back pretty quickly due to some errant behaviour (crossplane/upjet#263).

We may have been a little bit trigger happy with the rollback due to lack of understanding on our side, so we are going to try again, but we've been waiting for #373 to be released (James is my coworker) so we don't have to wrangle 2 upgrades in quick succession.

If you think this change isn't worth merging given the granular management policies, I'm happy for this PR to be closed.

@turkenf
Copy link
Collaborator

turkenf commented Sep 27, 2023

We can wait for you to try one more time before closing it. Please share the results with us.

@toastwaffle
Copy link
Author

We successfully upgraded 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NodePool container.gcp.upbound.io should not LateInitialize node_count
4 participants