-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Telemetry] track and warn event loop delays thresholds #103615
[Telemetry] track and warn event loop delays thresholds #103615
Conversation
This is another approach to implementing the event loop threshold (original draft PR: #103478) Instead of using the The I like this approach more. let me know what you think. cc @joshdover |
…ck_event_loop_threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit, but otherwise LGTM. Totally agree that this is going to be a much more helpful, stable, and accurate way of warning the admin of an issue. I do worry that this warning isn't highly actionable though, but would signals to support that they may have some scaling issues with Task Manager.
I do wonder if we should consider linking to our documentation on scaling task manager in production: https://www.elastic.co/guide/en/kibana/master/task-manager-production-considerations.html#_deployment_considerations
src/plugins/kibana_usage_collection/server/collectors/event_loop_delays/track_threshold.ts
Outdated
Show resolved
Hide resolved
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: |
💚 Backport successful
This backport PR will be merged automatically after passing CI. |
…03728) Co-authored-by: Ahmad Bamieh <[email protected]>
Summary
Part of a larger work to measure platform performance and ease debugging performance issues (#63848)
In 7.14 we started sending hourly updated event loop delays histogram (#101580). This helps us investigate average delays our customers have, percentiles, etc.
This PR warns users when the event loop delay exceeds a configurable threshold duration
ops. eventLoopDelayThreshold
By default this duration is 350ms logged once every 30 seconds as long as the delay is still above that target.
Once we have a representative sample from the reported delays histogram we can adjust this default to be more meaningful and closer to real world cases.
metrics.ops already reports
collected_at
and all the ecs object for further debuggablity around the logs.Implementation direction
This is another approach to implementing the event loop threshold (original draft PR: #103478)
Instead of using the
ops.metrics
implementation I used the the event loop delays histogram. The current ops metics implementation does not really capture event loop delays as it only captures the delay in the immediate loop when the measurement is made.The
perf.monitorEventLoopDelay()
here tracks the delays over time and not only on collection. This way we really capture delays and spikes. I also added some telemetry around these spikes to report them back to our cluster along the full histogram for diagnosis. which is not possible inside core at the moment without a lot of piping. I prefer this approach, let me know what you think.Notes
I've experimented with using event loop utilization (ELU) but realized it serves a different purpose than the original intention of this PR (draft #103477)
Related: #98673
Closes: #96192