-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager][Health] Warn Runtime Status for high Drift #166006
Labels
bug
Fixes for quality problems that affect the customer experience
enhancement
New value added to drive a business result
Feature:Task Manager
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
Comments
stefnestor
added
bug
Fixes for quality problems that affect the customer experience
enhancement
New value added to drive a business result
Feature:Task Manager
labels
Sep 7, 2023
Task Manager Health [A]
|
ppisljar
added
the
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
label
Sep 11, 2023
Pinging @elastic/response-ops (Team:ResponseOps) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
bug
Fixes for quality problems that affect the customer experience
enhancement
New value added to drive a business result
Feature:Task Manager
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
Summary
👋🏼 howdy, team!
I've noticed across a couple clusters that Kibana can end up in a degraded status due to
capacity_estimation
which really sources in informatively from highruntime
>drift
usuallydrift_by_type
ofalerting:*
(aka. Expensive Rules).The (I really feel is more) bug or (could be labelled instead as) FR I have is that even if
drift
isp50
backed up by 3mins usu. withload.p50: 100
thenruntime
still reportsstatus: OK
. Can we put some logic in there to flip this towarn
/error
at some point?Example
I've dealt with this situation with a couple of users, most egregious situations have been air-gapped so I can't share those examples. However, sharing a low-medium example output in full:
[A]
I wrote an automation to root-cause problematic plugin so reports:
My report automation goes on, but pivoting towards applicability for this Github, e.g. doc: Evaluate the Runtime quotes section
In our example(s) the load compared to this example doc section is instead actually
p50: 100
and drifted by >1min. In a recent air-gapped example (not represented just below) it was >3min drifted:So overall, it makes sense that this drift+load cascades into
capacity_estimation
messages since that's where the docs point. However for API response interpretation/usability or diagnostic automations, it doesn't really make sense thatruntime
didn't flag asstatus: warn
or something more problematic since the root-cause of the problem was something insideruntime
cascaded intocapacity_estimation
.Request
Unknown literal values but some logic like
🙏🏼
The text was updated successfully, but these errors were encountered: