Upstream Health Checks #50

AdamMagaluk · 2016-10-27T20:15:00Z

In order to provide robustness to backend services we should have nginx be doing basic health checks on each pod taking unhealthy pods out of the round robin. Using https://github.com/30x/nginx-upstream-check it would allow us to do this.

More information from each pod might be needed to enable http checks.

Health resource for the pod. /status
Health method. GET|HEAD
Timeout length. 5s
Interval to check. 30s
Unhealthy Threshold. 2
Healthy Threshold. 5

Should we allow all these to be configured? Are there any we can smart default to? If nothing is provided should we fallback to basic TCP open check?

My proposal for an implementation is adding a separate healthCheckResource annotation to each pod that would look like: GET / to identify the method and path of the http health check. We could also have 4 other optional annotations or combine them in one to specify the other options like timeout etc...

The text was updated successfully, but these errors were encountered:

jbowen93 · 2016-10-27T20:23:53Z

A few questions:

Would this require app authors to create a status endpoint in their application?
If they're doing that why can't we simply use native k8s health checks that should place the pod in a non Ready state and remove pods with that status from the round robin?

AdamMagaluk · 2016-10-27T20:32:22Z

Yes most likely if they don't already have one. Unless they don't want to, i think this would need to work as all optional.
I would propose we do both. I don't think the k8s-router needs to be aware of that aspect though. The benefit of having nginx doing health checks as well is any network related issues between the router and pod would be mitigated.

We probably need a separate issue opened in Kiln i think that would do the kubernetes health check.

Another side note, if the kubernetes api returns the info about the pods health check we might be able to just use that info vs creating separate annotations.

mpnally · 2016-10-27T21:05:17Z

I like failing back to TCP. It's not perfect, but it's a lot better than
nothing

Martin

On Thu, Oct 27, 2016 at 1:32 PM, Adam Magaluk [email protected]
wrote:

Yes most likely if they don't already have one. Unless they don't
want to, i think this would need to work as all optional.
2.

I would propose we do both. I don't think the k8s-router needs to be
aware of that aspect though. The benefit of having nginx doing health
checks as well is any network related issues between the router and pod
would be mitigated.

We probably need a separate issue opened in Kiln i think that would do the
kubernetes health check.

Another side note, if the kubernetes api returns the info about the pods
health check we might be able to just use that info vs creating separate
annotations.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#50 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACoA6mz-sLAQKrF8jTMWxaUJ7aE7i0X_ks5q4QpXgaJpZM4KixqQ
.

Martin Nally, Apigee Engineering

AdamMagaluk · 2016-10-28T19:20:38Z

My updated proposal after reviewing the kubernetes api would be calculate the nginx upstream health checks based on the livenessProbe of the container in the pod. This will allow us to keep the annotations to a minimum. Kiln/Enrober will have the option to populate either a http livenessProbe or a tcp one. k8s-router will fallback to tcp if neither are specified.

Based on the Kubernetes options we can map almost a one for one with nginx upstream check module.

Mapping from Kubernetes to Nginx:

periodSeconds -> interval: the check request's interval time.
failureThreshold -> fall(fall_count): After fall_count check failures, the server is marked down.
successThreshold -> rise(rise_count): After rise_count check success, the server is marked up.
timeoutSeconds -> timeout: the check request's timeout.
port -> port: specify the check port in the backend servers. It can be different with the original servers port. Default the port is 0 and it means the same as the original backend server.
path -> Used in the check_http_send command.

I think the only useful extra annotation that could be used on the pod could be DISABLE_UPSTREAM_CHECK to disable it all together. This could be useful if the pod needs kubernetes health checking but for some reason not want nginx to do upstream checking.

The only issue i could we see with this approach is each Pod could potentially have different Kubernetes liveliness probe settings and because nginx configures the checks on a per upstream cluster there may need to be some merging of values. I don't see this happening with normal use of Shipyard because all the pods should be spun up with the same spec from the replication controller but theoretically possible.

mpnally · 2016-10-28T22:26:08Z

HAProxy has another useful option, which is the ability to take an upstream (haproxy calls them backend) server out the pool if the server gives 50x errors on regular traffic, not just health-check traffic. Does Nginx have an equivalent?

mpnally · 2016-10-28T22:30:07Z

Also, the condition where pods with different health-check values are in the same upstream can happen in shipyard. Multiple applications can claim the same path. This is actually an important capability—this is what allows you to refactor apps. Having said that, I wouldn't worry abut it too much.

AdamMagaluk · 2016-10-31T15:17:17Z

@mpnally No it doesn't. This isn't available in standard nginx at all but from this module https://github.com/apigee/nginx_upstream_check_module

whitlockjc · 2016-10-31T16:22:36Z

Is apigee/nginx_upstream_check_module OSS? If not, I don't see us using this. As for the configuration required to make this work, it will be way to much to stuff into some annotation. We're definitely on the brink of embedding JSON into an annotation to handle routing configuration. (This is not specific to this change, it just might mean this change pushes us to it quicker.)

mpnally · 2016-10-31T16:29:42Z

I agree that it would be better to have an OSS-only solution and we should
work towards that. However, not being OSS is not, in my opinion, a
deal-breaker for a Shipyard release.

Martin

On Mon, Oct 31, 2016 at 9:22 AM, Jeremy Whitlock [email protected]
wrote:

Is apigee/nginx_upstream_check_module OSS? If not, I don't see us using
this. As for the configuration required to make this work, it will be way
to much to stuff into some annotation. We're definitely on the brink of
embedding JSON into an annotation to handle routing configuration. (This
is not specific to this change, it just might mean this change pushes us to
it quicker.)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#50 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACoA6jTmBnGtdfKEWsKvnHBlkCSC9kaNks5q5hXNgaJpZM4KixqQ
.

Martin Nally, Apigee Engineering

whitlockjc · 2016-10-31T16:34:26Z

Well, k8s-router isn't just a Shipyard component but a generic, non-Apigee, OSS router for Kubernetes. And while I don't have the ability to just refuse to use it, I really think we should stick with the current model we've adhered to for so long and keep k8s-router generic and not specific to Apigee or how Apigee uses k8s-router.

AdamMagaluk · 2016-10-31T16:54:35Z

The apigee/nginx_upstream_check_module isn't published but it is OSS and the fork it's based on it OSS with minor changes. BSD https://github.com/yaoweibin/nginx_upstream_check_module

As for the annotations, i'd agree if we have to put more stuff in their then a JSON object is probably better. However with the health checks of nginx module and kubernetes being configured almost the same way we could just piggyback on using what's configured in the Kubernetes liveness probe.

whitlockjc · 2016-10-31T16:59:01Z

I understand, I'm just saying we've already got a few annotations for routing and we'll soon be making them more complex for weight and other things, this included, so we might get there sooner than I had expected.

mpnally · 2016-10-31T17:04:10Z

I agree. We could also see if we can find a better solution than JSON, which is not very human-friendly

whitlockjc · 2016-10-31T20:11:31Z

I think it would be simpler to just use the Pod's livenessProbe (if available) to drive the nginx module's configuration for checking the upstreams.

mpnally · 2016-10-31T20:36:10Z

Yes, that is the current plan. Adam wrote this up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream Health Checks #50

Upstream Health Checks #50

AdamMagaluk commented Oct 27, 2016

jbowen93 commented Oct 27, 2016

AdamMagaluk commented Oct 27, 2016

mpnally commented Oct 27, 2016

AdamMagaluk commented Oct 28, 2016

mpnally commented Oct 28, 2016

mpnally commented Oct 28, 2016

AdamMagaluk commented Oct 31, 2016

whitlockjc commented Oct 31, 2016

mpnally commented Oct 31, 2016

whitlockjc commented Oct 31, 2016

AdamMagaluk commented Oct 31, 2016

whitlockjc commented Oct 31, 2016

mpnally commented Oct 31, 2016

whitlockjc commented Oct 31, 2016

mpnally commented Oct 31, 2016

Upstream Health Checks #50

Upstream Health Checks #50

Comments

AdamMagaluk commented Oct 27, 2016

jbowen93 commented Oct 27, 2016

AdamMagaluk commented Oct 27, 2016

mpnally commented Oct 27, 2016

AdamMagaluk commented Oct 28, 2016

mpnally commented Oct 28, 2016

mpnally commented Oct 28, 2016

AdamMagaluk commented Oct 31, 2016

whitlockjc commented Oct 31, 2016

mpnally commented Oct 31, 2016

whitlockjc commented Oct 31, 2016

AdamMagaluk commented Oct 31, 2016

whitlockjc commented Oct 31, 2016

mpnally commented Oct 31, 2016

whitlockjc commented Oct 31, 2016

mpnally commented Oct 31, 2016