Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Adding support for GKE preemptibles nodes #30

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fallard84
Copy link

Context: When a GKE nodes get preempted, pods that were on the pods remain there and will run when the nodes comes back online. It is also possible that the daemonset pods status will be ready even when the node is not yet ready (possibly staled cache)

Solution:

  • Added an option to configure the effect to set for the taint
  • Taint the node in case the pod is ready but the node is not

Context: When a GKE nodes get preempted, pods that were on the pods remain there and will run when the nodes comes back online. It is also possible that the daemonset pods status will be ready even when the node is not yet ready (possibly staled cache)

Solution:
- Added an option to configure the effect to set for the taint
- Taint the node in case the pod is ready but the node is not
@Joseph-Irving
Copy link
Contributor

Why do you need to add a taint if the node is not ready? Kubernetes by default adds a node.kubernetes.io/not-ready taint when a node isn't ready

@fallard84
Copy link
Author

Because we also want to wait for pods from a daemonset to be ready, not just the node.

@Joseph-Irving
Copy link
Contributor

Sorry I still don't follow, if the daemonset pods aren't ready nidhogg will add the nidhogg taints as it normally does, why do we need this extra check?

@fallard84
Copy link
Author

Sorry I still don't follow, if the daemonset pods aren't ready nidhogg will add the nidhogg taints as it normally does, why do we need this extra check?

Sorry for the confusion. I will explain a bit more our use case and what we have experienced using Nidhogg.

We are using GKE with preemptible nodes. That means that our nodes live at most 24h and get replaced continuously. We have a critical networking daemonset deployed and it must absolutely run before other pods can run. While using the current version of Nidhogg, I have seen the following happening when a node was coming back after being preempted:

  1. Nodes were sometime ready before Nidhogg had time to taint it. The taint was always applied but sometimes slightly too late. That caused pods to start running on the node before the daemonset pod was ready. While troubleshooting this issue, I was able to see that when the node was being initialized and not ready yet, Nidhogg would check for the daemonset pod status and was seeing that the pod was ready (even though the daemonset pod didn't even have time to start yet). I simply assumed this was caused by staled pod status cache. This is why I added a check to see that the node must also be ready in order to add the taint earlier in the process. This extra check could also be set optionally through config if this could cause some issue with some other setup.

  2. Pods without toleration were always running, even when Nidhogg had time to taint the node. Upon investigation, I realized that when a node was getting preempted, all pods that were on the node before the preemption were all already scheduled on the node even before it was ready. The default NoSchedule was only preventing new pods to be scheduled on the node, but not preventing already scheduled pod to start running. Hence adding an option to use NoExecute as the effect.

Hopefully that explains better 😅

@fallard84
Copy link
Author

@Joseph-Irving Do you have more questions/concerns?

@Joseph-Irving
Copy link
Contributor

So if I understand this correctly in GKE when your preemptible nodes get shut down they later start back up again? So the same node comes back up with the pods it previously had running on it? We use spot instances in AWS and they work in a similar way, but when they're terminated that's it, they're gone. A new node will replace them, so there's no weird stale cache thing going on.
I would rather make this ready check an optional code path, as it seems like a fairly niche edge case.
I think being able to configure taint effects is fine, I would just be cautious with them as noExecute can be quite disruptive. If you had some kind of cluster-wide outage of your networking Daemonset all your pods would be evicted from all of your nodes, which could potentially just be far more disruption than you actually need.

@universam1
Copy link

universam1 commented Jun 10, 2022

I see the point of @fallard84, let me rephrase:

  1. do not consider removing the nidhogg taint before the node status is ready, it might be too early. The daemonsets haven't been scheduled yet
  2. support noExecute in order to be disruptive by purpose, in order to cordon unhealthy nodes.

@Joseph-Irving would you be willing to merge this improvement? For our use case this feature is critical!

@jerkern
Copy link

jerkern commented Oct 18, 2022

This PR seems very useful in being able to customize the effect in the taint, eg. I have a use-case for PreferNoSchedule rather than plain NoSchedule

https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants