Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes-friendly health checking #6023

Closed
HT154 opened this issue Dec 31, 2018 · 4 comments
Closed

Kubernetes-friendly health checking #6023

HT154 opened this issue Dec 31, 2018 · 4 comments
Assignees
Labels
Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Type: Feature Issues that describe a new desired feature

Comments

@HT154
Copy link

HT154 commented Dec 31, 2018

Background

In the Kubernetes realm, liveness and readiness probes are used to determine when containers should be restarted due to issues and when a container is available to serve traffic, respectively. These checks can be performed in three ways:

  • exec’d commands in the container, expecting 0 exit code
  • HTTP GET requests, expecting <400 for OK, >=400 for failures
  • connections to a TCP socket, expecting a successful connection

Current situation

Habitat’s built-in health-checking mechanism currently only reports status through the supervisor REST API at /services/<name>/<group>/health in a format that the probe requests can’t use. The API currently always returns a 200 response with the health status encoded in the JSON response.

Proposed Solutions

Either of these could solve the problem on their own, but I'd very much like both to be implemented.

Compatible API endpoints

Add healthz endpoints to the REST API that mirror the standard health endpoints, but return 200 for healthy services and 500 otherwise. It would be beneficial to provide a health endpoint for the for the supervisor itself at /healthz, and a per-service endpoints at /services/<name>/<group>/healthz. The downside to this technique is that both the supervisor and Kubernetes perform checks periodically. If the supervisor checks health every 30 seconds, then it doesn't make sense for Kubernetes to check the REST API more often; but if the periods are offset the wrong way, there could potentially be up to a minute from a service problem -> supervisor health check -> Kubernetes readiness probe.

API endpoint parameters

Add GET parameter to the health endpoints that enable control over the return code for the different statuses. Prior art: HashiCorp Vault's health endpoint https://www.vaultproject.io/api/system/health.html. This may be the simplest solution and have the best effort-to-payoff ratio.

Direct check execution

The second path is a little more drastic: add the ability to disable the supervisor’s periodic health checks and provide a CLI interface to directly run a service’s health-check hook, à la hab pkg exec. This method allows Kubernetes to take over all of the scheduling responsibility and avoid the issue above.

It might even be possible for the Habitat Operator to configure either of these probes automatically, provided that services with no health check hook return successful statuses.

@HT154
Copy link
Author

HT154 commented Feb 7, 2019

Updated the description with another potential solution based on HashiCorp Vault's health check endpoint.

@stale
Copy link

stale bot commented Apr 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

@stale stale bot added the Stale label Apr 2, 2020
@HT154
Copy link
Author

HT154 commented Apr 2, 2020

Don't close this. The fact that issues like this haven't been addressed is one of the primary reasons my org is moving away from all use of Habitat. The use of the stale bot just adds insult to injury.

@christophermaier
Copy link
Contributor

It looks like this was at least partially addressed in October 2018 in #5725, specifically this change which maps health check results into HTTP status codes. This means that the health endpoint response code will reflect the actual status, rather than always returning 200.

To address the individual specific requests from this issue, though, I've spun out #7689 and #7690.

@christophermaier christophermaier added Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Type: Feature Issues that describe a new desired feature and removed A-supervisor labels Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Type: Feature Issues that describe a new desired feature
Projects
None yet
Development

No branches or pull requests

3 participants