Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attach a long event loop delay "span" to an APM transaction #128647

Open
mshustov opened this issue Mar 28, 2022 · 4 comments
Open

Attach a long event loop delay "span" to an APM transaction #128647

mshustov opened this issue Mar 28, 2022 · 4 comments
Labels
enhancement New value added to drive a business result performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc triage_needed wg:performance Work tracked by the performance workgroup

Comments

@mshustov
Copy link
Contributor

Since v7.14 PR, Kibana reports a warning if the mean value of event loop delay exceeds 350ms. It helps users spot a performance problem but not investigate it since the runtime context is absent.

To overcome the problem, we can borrow a few ideas from [this article] (https://www.ashbyhq.com/blog/engineering/detecting-event-loop-blockers).
TLDR: Server can capture the runtime context of the expensive tasks by implementing a custom async hook tracking the duration of a task and attaching it to an APM transaction. It allows Cloud customers quickly identify what APM transaction triggers CPU-bound tasks on the Kibana server.

@mshustov mshustov added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc triage_needed labels Mar 28, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@mshustov mshustov added performance wg:performance Work tracked by the performance workgroup labels Mar 28, 2022
@mshustov mshustov added the enhancement New value added to drive a business result label Apr 19, 2022
@gsoldevila
Copy link
Contributor

gsoldevila commented Jan 11, 2024

Status update

  • https://github.com/elastic/kibana/pull/154022/files added Elu measurements "per request", on May 18, 2023.
    • It WARN logs whenever measurement exceeds the defined threshold.
    • It does not log by default, and can be opted-in.
    • The problem is we can't ensure that a specific request is causing the delays, only that it is impacted by them.
  • blocked library uses a simple, timer-based mechanism to detect delays, which Node integrated on perf_hooks#monitorEventLoopDelay.
    • We already use this feature on our EventLoopDelaysMonitor, used then by the kibana_usage_collection plugin.
    • This approach has 2 main limitations:
      • We cannot know which call / stack is blocking the event loop.
      • We can't report while the event loop is blocked.
  • blocked-at uses a completely different approach, based on Async Hooks.
    • Brandon gave it a shot on Event loop block detection #129402, but closed the PR. Gotta ask him his conclusions, if any.
    • This approach has some problems too:
      • We cannot do anything that implies systematic async work in the after() hook, or it will result in an infinite loop.
      • Async Hooks adds a hook for EVERY async callback, so it can be really costly. Quoting the authors of the library:

There's a performance cost to enabling Async Hooks. It's recommended to detect blocking exists with something without the performance overhead and use blocked-at in testing environment to pinpoint where the slowdown happens. Rule of thumb is you should not be running this in production unless desperate.

  • I explored the AsyncLocalStorage path, which could supposedly help provide context throughout the lifecycle of a request.
    • Only to realise that it does NOT help determining the culprit when using blocked or perf_hooks#monitorEventLoopDelay.

@gsoldevila
Copy link
Contributor

gsoldevila commented Jan 11, 2024

The article mentioned in the description pinpoints 2 main scenarios that can cause event loop delays:

  • either due to non-optimized code that didn't scale very well
  • or the lack of result set limits (which could cause an endpoint to return an arbitrarily large amount of data).

Perhaps rather than trying to detect the blocks with a timer-based strategy, we could try to calculate them per request.
We know that most of the flows are [Browser => ] Kibana => ES

  1. We identify the entrypoints of the different use cases.
  2. AsyncLocalStorage could help us keep track of the total request time, and then subtracting the time we are doing HTTP requests to ES, effectively obtaining the "self" time for each use case.
  3. We can WARN when the self time exceeds a certain threshold.

UPDATE: Unfortunately, this does not guarantee that a "blocked" request is the culprit.

@pgayvallet
Copy link
Contributor

@gsoldevila is there anything we can do on this issue, or should we just close it as won't do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc triage_needed wg:performance Work tracked by the performance workgroup
Projects
None yet
Development

No branches or pull requests

4 participants