Scheduler Liveness Prob Failures #767
Unanswered
alias-santi
asked this question in
Questions & Answers
Replies: 1 comment
-
Did you manage to fix this? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi All,
Raising a discussion vs a bug issue as not sure if this is a more localised issue to me than an actual bug but:
We recently upgrade our Airflow environment from 2.2.5 to 2.6.2 and helm chart from 8.6.0 to 8.7.1.
K8s version is 1.23 running on EKS.
Since we did the upgrade we've been experiencing a major increase in the number of liveness probe failures with the Scheduler and we get enough of the to experience pod terminations etc. Below is a snippet of the error from the liveness probe failure:
Warning Unhealthy 11m (x219 over 13d) kubelet Liveness probe failed:
No msg is actually returned from the probe which is interesting. Don't believe this to be related to a timeout of the probe as we've increased those with no success.
We did notice high CPU consumption at the time so decided to increase the requests and limits to aid this. We noticed a small improvement but something is still not right here.
Below is a snippet of our scheduler values:
Below is a screenshot from datadog showing some CPU metrics over the last day:
Below is a screenshot over the last week:
I did another test by taking the liveness probe in use, adding some logging to it and running in a loop to see how long it would run etc. I noticed that the is_alive() check would return false after the 3rd invocation with a time sleep of 15 secs on each cycle. It would return false on the 2nd invocation with a time sleep of 30 secs. Running the liveness probe as a single instance randomly always seems to work so not sure exactly what might be wrong here.
Maybe it's the fact that the k8s liveness probe is running at the same time the scheduler is checking home with a heartbeat? or maybe the heartbeat itself is not coming home in time so it's being marked as not healthy at the time the liveness probe is also running?
Any suggestions on whether there is some additional configuration we might need to consider applying here? or other ares we need to look at specifically? Not sure where to look next this stage so thought I'd pop this here case it's a known issue that others have managed to rectify before.
Let me know thoughts. Happy to share more details on our configuration etc
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions