Scheduler Liveness Prob Failures #767

alias-santi · 2023-08-09T11:55:07Z

alias-santi
Aug 9, 2023

Hi All,

Raising a discussion vs a bug issue as not sure if this is a more localised issue to me than an actual bug but:

We recently upgrade our Airflow environment from 2.2.5 to 2.6.2 and helm chart from 8.6.0 to 8.7.1.

K8s version is 1.23 running on EKS.

Since we did the upgrade we've been experiencing a major increase in the number of liveness probe failures with the Scheduler and we get enough of the to experience pod terminations etc. Below is a snippet of the error from the liveness probe failure:

Warning Unhealthy 11m (x219 over 13d) kubelet Liveness probe failed:

No msg is actually returned from the probe which is interesting. Don't believe this to be related to a timeout of the probe as we've increased those with no success.

We did notice high CPU consumption at the time so decided to increase the requests and limits to aid this. We noticed a small improvement but something is still not right here.

Below is a snippet of our scheduler values:

    scheduler:
      resources:
        requests:
          memory: "750Mi"
          cpu: "700m"
        limits:
          memory: "750Mi"
          cpu: "2000m"
      
      livenessProbe:
        enabled: true
        initialDelaySeconds: 10
        periodSeconds: 60
        timeoutSeconds: 120
        failureThreshold: 3

Below is a screenshot from datadog showing some CPU metrics over the last day:

Below is a screenshot over the last week:

I did another test by taking the liveness probe in use, adding some logging to it and running in a loop to see how long it would run etc. I noticed that the is_alive() check would return false after the 3rd invocation with a time sleep of 15 secs on each cycle. It would return false on the 2nd invocation with a time sleep of 30 secs. Running the liveness probe as a single instance randomly always seems to work so not sure exactly what might be wrong here.

Maybe it's the fact that the k8s liveness probe is running at the same time the scheduler is checking home with a heartbeat? or maybe the heartbeat itself is not coming home in time so it's being marked as not healthy at the time the liveness probe is also running?

Any suggestions on whether there is some additional configuration we might need to consider applying here? or other ares we need to look at specifically? Not sure where to look next this stage so thought I'd pop this here case it's a known issue that others have managed to rectify before.

Let me know thoughts. Happy to share more details on our configuration etc

Thanks!

rroberts-capula · 2024-10-07T13:58:55Z

rroberts-capula
Oct 7, 2024

Did you manage to fix this?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler Liveness Prob Failures #767

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Scheduler Liveness Prob Failures #767

alias-santi Aug 9, 2023

Replies: 1 comment

rroberts-capula Oct 7, 2024

alias-santi
Aug 9, 2023

rroberts-capula
Oct 7, 2024