Attach a long event loop delay "span" to an APM transaction #128647

mshustov · 2022-03-28T15:02:16Z

Since v7.14 PR, Kibana reports a warning if the mean value of event loop delay exceeds 350ms. It helps users spot a performance problem but not investigate it since the runtime context is absent.

To overcome the problem, we can borrow a few ideas from [this article] (https://www.ashbyhq.com/blog/engineering/detecting-event-loop-blockers).
TLDR: Server can capture the runtime context of the expensive tasks by implementing a custom async hook tracking the duration of a task and attaching it to an APM transaction. It allows Cloud customers quickly identify what APM transaction triggers CPU-bound tasks on the Kibana server.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-03-28T15:02:17Z

Pinging @elastic/kibana-core (Team:Core)

gsoldevila · 2024-01-11T10:49:39Z

Status update

https://github.com/elastic/kibana/pull/154022/files added Elu measurements "per request", on May 18, 2023.
- It WARN logs whenever measurement exceeds the defined threshold.
- It does not log by default, and can be opted-in.
- The problem is we can't ensure that a specific request is causing the delays, only that it is impacted by them.
blocked library uses a simple, timer-based mechanism to detect delays, which Node integrated on perf_hooks#monitorEventLoopDelay.
- We already use this feature on our EventLoopDelaysMonitor, used then by the kibana_usage_collection plugin.
- This approach has 2 main limitations:
  - We cannot know which call / stack is blocking the event loop.
  - We can't report while the event loop is blocked.
blocked-at uses a completely different approach, based on Async Hooks.
- Brandon gave it a shot on Event loop block detection #129402, but closed the PR. Gotta ask him his conclusions, if any.
- This approach has some problems too:
  - We cannot do anything that implies systematic async work in the after() hook, or it will result in an infinite loop.
  - Async Hooks adds a hook for EVERY async callback, so it can be really costly. Quoting the authors of the library:

There's a performance cost to enabling Async Hooks. It's recommended to detect blocking exists with something without the performance overhead and use blocked-at in testing environment to pinpoint where the slowdown happens. Rule of thumb is you should not be running this in production unless desperate.

I explored the AsyncLocalStorage path, which could supposedly help provide context throughout the lifecycle of a request.
- Only to realise that it does NOT help determining the culprit when using blocked or perf_hooks#monitorEventLoopDelay.

gsoldevila · 2024-01-11T11:40:29Z

The article mentioned in the description pinpoints 2 main scenarios that can cause event loop delays:

either due to non-optimized code that didn't scale very well
or the lack of result set limits (which could cause an endpoint to return an arbitrarily large amount of data).

Perhaps rather than trying to detect the blocks with a timer-based strategy, we could try to calculate them per request.
We know that most of the flows are [Browser => ] Kibana => ES

We identify the entrypoints of the different use cases.
AsyncLocalStorage could help us keep track of the total request time, and then subtracting the time we are doing HTTP requests to ES, effectively obtaining the "self" time for each use case.
We can WARN when the self time exceeds a certain threshold.

UPDATE: Unfortunately, this does not guarantee that a "blocked" request is the culprit.

pgayvallet · 2024-07-08T08:02:15Z

@gsoldevila is there anything we can do on this issue, or should we just close it as won't do?

mshustov added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc triage_needed labels Mar 28, 2022

mshustov added performance wg:performance Work tracked by the performance workgroup labels Mar 28, 2022

lukeelmers mentioned this issue Mar 28, 2022

Attach a long event loop delay span to an APM transaction #128646

Closed

mshustov added the enhancement New value added to drive a business result label Apr 19, 2022

gsoldevila mentioned this issue Feb 15, 2024

[Stack Monitoring] More detailed event loop diagnostics #134452

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attach a long event loop delay "span" to an APM transaction #128647

Attach a long event loop delay "span" to an APM transaction #128647

mshustov commented Mar 28, 2022

elasticmachine commented Mar 28, 2022

gsoldevila commented Jan 11, 2024 •

edited

Loading

gsoldevila commented Jan 11, 2024 •

edited

Loading

pgayvallet commented Jul 8, 2024

Attach a long event loop delay "span" to an APM transaction #128647

Attach a long event loop delay "span" to an APM transaction #128647

Comments

mshustov commented Mar 28, 2022

elasticmachine commented Mar 28, 2022

gsoldevila commented Jan 11, 2024 • edited Loading

Status update

gsoldevila commented Jan 11, 2024 • edited Loading

pgayvallet commented Jul 8, 2024

gsoldevila commented Jan 11, 2024 •

edited

Loading

gsoldevila commented Jan 11, 2024 •

edited

Loading