Kubernetes-friendly health checking #6023

HT154 · 2018-12-31T18:35:13Z

Background

In the Kubernetes realm, liveness and readiness probes are used to determine when containers should be restarted due to issues and when a container is available to serve traffic, respectively. These checks can be performed in three ways:

exec’d commands in the container, expecting 0 exit code
HTTP GET requests, expecting <400 for OK, >=400 for failures
connections to a TCP socket, expecting a successful connection

Current situation

Habitat’s built-in health-checking mechanism currently only reports status through the supervisor REST API at /services/<name>/<group>/health in a format that the probe requests can’t use. The API currently always returns a 200 response with the health status encoded in the JSON response.

Proposed Solutions

Either of these could solve the problem on their own, but I'd very much like both to be implemented.

Compatible API endpoints

Add healthz endpoints to the REST API that mirror the standard health endpoints, but return 200 for healthy services and 500 otherwise. It would be beneficial to provide a health endpoint for the for the supervisor itself at /healthz, and a per-service endpoints at /services/<name>/<group>/healthz. The downside to this technique is that both the supervisor and Kubernetes perform checks periodically. If the supervisor checks health every 30 seconds, then it doesn't make sense for Kubernetes to check the REST API more often; but if the periods are offset the wrong way, there could potentially be up to a minute from a service problem -> supervisor health check -> Kubernetes readiness probe.

API endpoint parameters

Add GET parameter to the health endpoints that enable control over the return code for the different statuses. Prior art: HashiCorp Vault's health endpoint https://www.vaultproject.io/api/system/health.html. This may be the simplest solution and have the best effort-to-payoff ratio.

Direct check execution

The second path is a little more drastic: add the ability to disable the supervisor’s periodic health checks and provide a CLI interface to directly run a service’s health-check hook, à la hab pkg exec. This method allows Kubernetes to take over all of the scheduling responsibility and avoid the issue above.

It might even be possible for the Habitat Operator to configure either of these probes automatically, provided that services with no health check hook return successful statuses.

The text was updated successfully, but these errors were encountered:

HT154 · 2019-02-07T17:33:34Z

Updated the description with another potential solution based on HashiCorp Vault's health check endpoint.

stale · 2020-04-02T21:10:49Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

HT154 · 2020-04-02T22:12:18Z

Don't close this. The fact that issues like this haven't been addressed is one of the primary reasons my org is moving away from all use of Habitat. The use of the stale bot just adds insult to injury.

christophermaier · 2020-05-15T21:50:05Z

It looks like this was at least partially addressed in October 2018 in #5725, specifically this change which maps health check results into HTTP status codes. This means that the health endpoint response code will reflect the actual status, rather than always returning 200.

To address the individual specific requests from this issue, though, I've spun out #7689 and #7690.

christophermaier added A-supervisor labels Jan 15, 2019

dmccown added this to the 1.0 Supervisor milestone Jan 15, 2019

stale bot added the Stale label Apr 2, 2020

stale bot removed the Stale label Apr 2, 2020

krasnow assigned christophermaier May 15, 2020

This was referenced May 15, 2020

Add healthz endpoints for health checking #7689

Open

Configurable health checking return codes #7690

Open

christophermaier closed this as completed May 15, 2020

christophermaier added Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Type: Feature Issues that describe a new desired feature and removed A-supervisor labels Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes-friendly health checking #6023

Kubernetes-friendly health checking #6023

HT154 commented Dec 31, 2018 •

edited

Loading

HT154 commented Feb 7, 2019

stale bot commented Apr 2, 2020

HT154 commented Apr 2, 2020

christophermaier commented May 15, 2020

Kubernetes-friendly health checking #6023

Kubernetes-friendly health checking #6023

Comments

HT154 commented Dec 31, 2018 • edited Loading

Background

Current situation

Proposed Solutions

Compatible API endpoints

API endpoint parameters

Direct check execution

HT154 commented Feb 7, 2019

stale bot commented Apr 2, 2020

HT154 commented Apr 2, 2020

christophermaier commented May 15, 2020

HT154 commented Dec 31, 2018 •

edited

Loading