-
Notifications
You must be signed in to change notification settings - Fork 15
Small window where nodes are untainted? #24
Comments
Here's one that has essentially done the right thing and added all 4 taints at once, but it flip-flopped around a bit removing and re-adding taints as it went which is a little strange too.
|
What is the status of your pods when this is happening? if your pods are restarting or alternating between passing and failing readiness, then this is expected behaviour |
Awesome. Thanks for your fast response. Ah yes; pod flapping would very likely be the cause of some of the taints being added and removed. What about the first set of logs where the fluentd taint wasn't added until after the others had been added & removed and the pod became untainted for about 3 seconds? |
Could it be possible that the fluentd pod was already ready by the time that the taints were added? There is a delay between your node coming up and nidhogg applying the taint, so theoretically if the fluentd pod started up very quickly and passed readiness then it wouldn't have its taint added. |
It's just strange because later the taint is added for the fluentd pod. I'll have to do more investigation to see if its readiness check was flapping. It's almost like Nidhogg could possibly benefit from a configuration setting saying that a service needs to pass X number of successful checks in a row before a taint is removed to deal with services that may flap as nodes are coming up. |
To somewhat play devil's advocate, it's probably worth making sure the readiness check is working as intended. As just generally having a very flaky check in Kubernetes isn't ideal as a number of decisions may be made based on that info. Nidhogg checks the taint every time there's an update, e.g a change of state in one of the pods, so if were to implement the multiple checks option, it's worth noting that it could pass the threshold of checks very rapidly. E.g
|
I think this can be somehow mitigated by starting new nodes with existing taints via |
We're experiencing this as well, or something very similar, but are solving it via a mutating webhook which taints the nodes at create time. Seemed cleaner and easier to update than having to alter kubelet args. In our case, it doesn't appear that the |
Thank you very much for Nidhogg!
I've just implemented it for the first time, but I'm seeing some behaviour that I'm not sure is correct, so would like to run it by you and see if there is something that I can do to fix it up.
I have the following YAML configuration:
With the above, I'm waiting on 4 critical daemonsets.
I have been looking at the Nidhogg logs as nodes are added to the cluster by the cluster-autoscaler.
What I'm finding is that initially taints are added by Nidhogg, but not for all of the 4 daemon sets.
Often it is 3 of them, which then clear, and the
firstTimeReady
gets set, and then a few seconds later the missing 4th taint is added, along with the other 3 and then they proceed to get removed again as things become ready.This appears to give a 2 or 3 second window where pods may be able schedule onto the node even though it isn't quite ready yet.
An example of logs showing this follows:
The first line has added the taints for
kiam
,node-local-dns
, andweave-net
, but there is no taint added forfluentd
yet.The next 3 lines are the 3 taints it has being removed one by one, with the final one marking the node as
taintLess
and settingfirstTimeReady
.Then the next line (roughly 3 seconds later) adds the
fluentd
taint that was previously missing.It is these 3 seconds that I'm concerned about.
Next 3 more lines re-add the previous taints that were removed, and then the taints are all removed again until the node is ready.
It's not always fluentd that is the left until later, sometime it is node-local-dns instead.
And it's not always 3 taints that are initially addeded either; I've also seen just 2 of the taints added in the first line.
I haven't had it installed for long, but if it is useful I can collect more details and pass them on.
The text was updated successfully, but these errors were encountered: