Add Zookeeper liveness probes #76

solsson · 2017-10-14T06:21:25Z

The logical conclusion of #65 and #74

elm- · 2017-10-16T12:12:28Z

This should also apply the fix from #73 (-q 1) at least I think. Some other notes on liveness versus readiness based on our experience:

If both have the same setting, something is probably wrong or not thought through. As I understand it, readiness is to ensure an instances can service request, liveness to check if it's still there. Kubernetes implements also the different behaviours, something that is not ready will just not service traffic, but still kept running. Something that isn't alive will be re restarted. A Zookeeper node that responds with not OK can still be alive but just re-syncing or have whatever other problem (disk full, etc). Where a restart would not solve the problem, on the contrary, can even pro long it (if it get's restarted before it's fully resynced for example and on each restart it always resyncs).

My suggestion therefore:

keep readiness as is, maybe reduce the failure threshold to 1 though to make it fail fast and stop getting traffic
use for liveness a tcp probe or if it's an ruok check with failure threshold and interval that is higher to make sure nodes don't restart themselves due to other issues

In general it's a bit tricky with Zookeeper I guess, because the Zookeeper clients themselves already implement stuff like this =)

solsson · 2017-11-09T16:21:51Z

@elm- makes a good point above. The scope distinction is more clear to me now after #81 (comment) and #55 (comment), which is why I just created the automation label.

We have to assume that a production setup has proper alarms, for example on statefulset pod (un)readiness, and that Kafka ops involves manual intervention. An example of such alarms can be found in Yolean/kubernetes-assert#7.

solsson · 2017-11-09T16:22:53Z

I will close this, and we can reopen it if the scope of the project changes.

solsson added 3 commits October 14, 2017 08:17

Reinstates the liveness probes from #65

3ba3e56

Applies the fix from #74 to liveness as well

fd372f4

Allows a longer timeout for liveness, in case the node is very busy etc

e2d01cc

solsson mentioned this pull request Oct 25, 2017

Track progress for Kubernetes 1.8 / kubernetes-kafka v3.0.0 #84

Closed

12 tasks

solsson closed this Nov 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Zookeeper liveness probes #76

Add Zookeeper liveness probes #76

solsson commented Oct 14, 2017 •

edited

Loading

elm- commented Oct 16, 2017

solsson commented Nov 9, 2017

solsson commented Nov 9, 2017

Add Zookeeper liveness probes #76

Add Zookeeper liveness probes #76

Conversation

solsson commented Oct 14, 2017 • edited Loading

elm- commented Oct 16, 2017

solsson commented Nov 9, 2017

solsson commented Nov 9, 2017

solsson commented Oct 14, 2017 •

edited

Loading