Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Zookeeper liveness probes #76

Closed
wants to merge 3 commits into from
Closed

Conversation

solsson
Copy link
Contributor

@solsson solsson commented Oct 14, 2017

The logical conclusion of #65 and #74

@elm-
Copy link
Contributor

elm- commented Oct 16, 2017

This should also apply the fix from #73 (-q 1) at least I think. Some other notes on liveness versus readiness based on our experience:

If both have the same setting, something is probably wrong or not thought through. As I understand it, readiness is to ensure an instances can service request, liveness to check if it's still there. Kubernetes implements also the different behaviours, something that is not ready will just not service traffic, but still kept running. Something that isn't alive will be re restarted. A Zookeeper node that responds with not OK can still be alive but just re-syncing or have whatever other problem (disk full, etc). Where a restart would not solve the problem, on the contrary, can even pro long it (if it get's restarted before it's fully resynced for example and on each restart it always resyncs).

My suggestion therefore:

  • keep readiness as is, maybe reduce the failure threshold to 1 though to make it fail fast and stop getting traffic
  • use for liveness a tcp probe or if it's an ruok check with failure threshold and interval that is higher to make sure nodes don't restart themselves due to other issues

In general it's a bit tricky with Zookeeper I guess, because the Zookeeper clients themselves already implement stuff like this =)

@solsson
Copy link
Contributor Author

solsson commented Nov 9, 2017

@elm- makes a good point above. The scope distinction is more clear to me now after #81 (comment) and #55 (comment), which is why I just created the automation label.

We have to assume that a production setup has proper alarms, for example on statefulset pod (un)readiness, and that Kafka ops involves manual intervention. An example of such alarms can be found in Yolean/kubernetes-assert#7.

@solsson
Copy link
Contributor Author

solsson commented Nov 9, 2017

I will close this, and we can reopen it if the scope of the project changes.

@solsson solsson closed this Nov 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants