7.16 event loop delay value incorrect #116778

rudolf · 2021-10-29T15:01:46Z

An idle 7.16 cluster with monitoring shows an event loop delay of 3 hours (11005301 ms)! This cannot be accurate because response times are still at most 1000 ms.

This appears to be a bug in the event loop delay collector not the monitoring app because the same data is present in the monitoring-* indices.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-10-29T15:01:48Z

Pinging @elastic/kibana-core (Team:Core)

TinaHeiligers · 2021-11-11T00:09:39Z

After digging into this it looks like the "bug" is related to the units in which we store the data:

Node's native monitorEventLoopDelay returns the data in nanoseconds:

"Creates an IntervalHistogram object that samples and reports the event loop delay over time. The delays will be reported in nanoseconds."

while Monitoring assumes it to be in milliseconds.

We have a few options here:

Handle the conversion at data storage level: convert the raw data on collection from nanoseconds to milliseconds (and add a comment about why this is needed):

    const lastUpdated = new Date();
    this.loopMonitor.disable();
    const { min, max, mean, exceeds, stddev } = this.loopMonitor;

    const collectedData: IntervalHistogram = {
      min: min / 1000000,
      max: max / 1000000,
      mean: mean / 1000000,
      exceeds: exceeds / 1000000,
      stddev: stddev / 1000000,
      fromTimestamp: this.fromTimestamp.toISOString(),
      lastUpdatedAt: lastUpdated.toISOString(),
      percentiles: {
        50: this.loopMonitor.percentile(50) / 1000000,
        75: this.loopMonitor.percentile(75) / 1000000,
        95: this.loopMonitor.percentile(95) / 1000000,
        99: this.loopMonitor.percentile(99) / 1000000,
      },
    };

    this.loopMonitor.enable();
    return collectedData;
  }

Or
2. Handle ns -> ms metrics conversion on consumer side: i.e. Change units in event_loop_delay visualization to ns rather than ms

OR convert data from ns -> ms before display.

We convert from ns to ms in track_threshold but I'm not sure if there are other consumers who treat the data as ns rather than ms.

@rudolf WDYT?

mshustov · 2021-11-11T11:12:04Z

Nanoseconds aren't human-friendly. It's hard to say whether 1080000ns or 108000ns is fast enough or not. However, the difference between 1ms and 0.1ms is easier to spot for a human eye.
Also, we aren't interested in small fluctuations of the event loop delay to justify nanoseconds usage.
IMO there is no good reason to change the current default, so I'd better do with the option 1. Even though, changes in event_loop_delay_histogram units might affect downstream plugins.

TinaHeiligers · 2021-11-11T16:29:57Z

there is no good reason to change the current default

Is there a list of default units Kibana uses somewhere? If not, maybe it might be a good idea to document that, both for development guidance and for end-users.

mshustov · 2021-11-11T18:13:12Z

If not, maybe it might be a good idea to document that, both for development guidance and for end-users.

I don't know to be honest, but I think there is no one-fits-all solution. The units are domain-specific: for performance metrics, we can use nanoseconds; for a backup copying period, we can specify seconds or minutes.
We can start with adding units in form of the comments for the Core public API. Probably, Monitoring caught the problem earlier if Core documented event_loop_delay_histogram uses nanoseconds units.

rudolf · 2021-11-11T20:33:26Z

Can you test what happens on 7.15 or 7.14? I'm not sure when this regression was introduced, but it used to show the correct value. I would guess that we always used to store ms values but that the Nodejs v16 upgrade changed the units to ns.

If this is the case, I think we should continue to store these values in ms so that a user can compare event-loop monitoring data from before and after an upgrade from e.g. 7.14 to 7.16.

TinaHeiligers · 2021-11-12T01:36:06Z

Probably, Monitoring caught the problem earlier if Core documented event_loop_delay_histogram uses nanoseconds units.

@mshustov There is a comment in the IntervalHistogram interface.
To prevent this happening again, where is the best place to add the units? In the interface's keys?

Can you test what happens on 7.15 or 7.14?

@rudolf I spun up a 7.14 deployment on cloud and the order of magnitudes comparison between Event loop Delay and Client Response Times shows that the delays are about 2 orders of magnitude less than the response times

For details of how this bug was introduced, see #118447

rudolf added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.16.0 labels Oct 29, 2021

lukeelmers added the EnableJiraSync label Nov 1, 2021

exalate-issue-sync bot added loe:small Small Level of Effort and removed loe:medium Medium Level of Effort labels Nov 9, 2021

TinaHeiligers self-assigned this Nov 10, 2021

TinaHeiligers mentioned this issue Nov 12, 2021

Handles ns to ms conversion for event loop delay metrics #118447

Merged

1 task

TinaHeiligers closed this as completed in #118447 Nov 15, 2021

exalate-issue-sync bot reopened this Nov 17, 2021

exalate-issue-sync bot closed this as completed Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7.16 event loop delay value incorrect #116778

7.16 event loop delay value incorrect #116778

rudolf commented Oct 29, 2021

elasticmachine commented Oct 29, 2021

TinaHeiligers commented Nov 11, 2021 •

edited

Loading

mshustov commented Nov 11, 2021 •

edited

Loading

TinaHeiligers commented Nov 11, 2021

mshustov commented Nov 11, 2021 •

edited

Loading

rudolf commented Nov 11, 2021 •

edited

Loading

TinaHeiligers commented Nov 12, 2021 •

edited

Loading

7.16 event loop delay value incorrect #116778

7.16 event loop delay value incorrect #116778

Comments

rudolf commented Oct 29, 2021

elasticmachine commented Oct 29, 2021

TinaHeiligers commented Nov 11, 2021 • edited Loading

mshustov commented Nov 11, 2021 • edited Loading

TinaHeiligers commented Nov 11, 2021

mshustov commented Nov 11, 2021 • edited Loading

rudolf commented Nov 11, 2021 • edited Loading

TinaHeiligers commented Nov 12, 2021 • edited Loading

TinaHeiligers commented Nov 11, 2021 •

edited

Loading

mshustov commented Nov 11, 2021 •

edited

Loading

mshustov commented Nov 11, 2021 •

edited

Loading

rudolf commented Nov 11, 2021 •

edited

Loading

TinaHeiligers commented Nov 12, 2021 •

edited

Loading