Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on-cluster health checks - [Epic] #141

Open
18 tasks
philbrookes opened this issue May 28, 2024 · 0 comments
Open
18 tasks

on-cluster health checks - [Epic] #141

philbrookes opened this issue May 28, 2024 · 0 comments

Comments

@philbrookes
Copy link
Collaborator

philbrookes commented May 28, 2024

Prior Art

https://github.com/Kuadrant/multicluster-gateway-controller/tree/60f13a1f7ad8f2b82e3f344a425285f69fb91223/pkg/dns/health

Terminology

  • Leaf Record: Either a CNAME or IP address taken from the status of a gateway

Tasks

Executing health checks

Consulting health checks

E2E Test cases

  • Unhealthy IP is removed when other leaf records are present
  • Unhealthy IP is preserved when it is the only leaf record
  • Unhealthy CNAME is removed when other leaf records are present
  • Unhealthy CNAME is preserved when it is the only leaf record
  • Unhealthy workload is noted correctly in health check probe CR
  • Healthy workload is noted correctly in health check probe CR
  • Metrics are emitted when unhealthy workload is detected
  • HealthCheckProbe CR is updated correctly when unhealthy endpoint becomes healthy

Black box testing

  • Add black box tests (test it from users perspective)

Load Testing

  • Add a test for a gateway with 64 listeners and 2 CNAMEs resolving to 2 IPs (i.e. 128 probes against 2 IPs)

Documenting Health Checks

Current State

  • We only have implementation for AWS DNS Health checks, and they will only function if the endpoint is an A record and not a CNAME.
  • We do not have a known way of implementing GCP Health checks, if the clusters are not in Google Cloud.
  • We have no current plan for implementing Azure health checks, but it is implemented quite differently to AWS Health checks.

Use cases we want to solve

  • As a cluster admin, I want to ensure that NXDomain responses are avoided when all endpoints are unhealthy.
  • As a cluster admin, I want to ensure that unhealthy responses will not be included the DNS lookup.
  • As a cluster admin, I want to be able to set health checks against CNAME records as well as A records.
  • As a cluster admin, I want to be able to create health checks regardless of my DNS Provider.

Proposed approach

We will implement local health checks, where the workload on the cluster is requested by a probe running on the cluster, through the external gateway, to simulate real internet traffic.

This will not require any changes to our API, we can reuse the existing health check specification in the DNS Policy exactly as is.

The results of the probe will be stored on a CR locally (one per probe), and also emitted as metrics.

When is a probe unhealthy

A probe will write to a probe CR a few pieces of information:

  • When it last checked
  • How many consecutive failures have occurred

When is a record unhealthy

The DNS Policy will specify a fault tolerance, and if the consecutive failures on the relevant probe CR are above that number, then the corresponding record is considered unhealthy, unless the last checked time is too old (i.e. a probe has stopped updating the probe CR).

When are unhealthy records removed from the zone

A record is removed from the zone if:

  • There are more records left in the zone
  • AND either:
    • The probe is unhealthy

Update our tests to include tests of the health check probes.

Tradeoffs

  • We will not be able to report on the health of the workload from other geographical areas.
  • If the cluster goes away, the controller dies or is denied access to the zone; the unhealthy records will stay in the DNS response until manual intervention.
  • If all clusters but one are unhealthy, and the last healthy cluster is gracefully deleted, there will temporarily (until the next time an unhealthy cluster reconciles) be an empty zone.
  • If the controller is acting with a networking configuration that allows it to access itself when the internet cannot, or vice versa, the health check probe will be inaccurate.
  • If the probes are failing to execute, or failing to update the probe CR, then all endpoints will be considered healthy.

Related Information

initial thoughts on health checks, and potential for cross-cluster health checks in the future: here

@philbrookes philbrookes changed the title on-cluster health checks Feature: on-cluster health checks May 28, 2024
@philbrookes philbrookes self-assigned this May 28, 2024
@maleck13 maleck13 added this to the kuadrant-v1 milestone May 31, 2024
@maleck13 maleck13 changed the title Feature: on-cluster health checks on-cluster health checks May 31, 2024
@philbrookes philbrookes added next and removed next labels Jun 13, 2024
@maleck13 maleck13 changed the title on-cluster health checks on-cluster health checks - [Epic] Jul 25, 2024
@philbrookes philbrookes removed their assignment Aug 1, 2024
@maleck13 maleck13 added the kind/epic Epic label Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

2 participants