Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7.16 event loop delay value incorrect #116778

Closed
rudolf opened this issue Oct 29, 2021 · 7 comments · Fixed by #118447
Closed

7.16 event loop delay value incorrect #116778

rudolf opened this issue Oct 29, 2021 · 7 comments · Fixed by #118447
Assignees
Labels
bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.16.0

Comments

@rudolf
Copy link
Contributor

rudolf commented Oct 29, 2021

An idle 7.16 cluster with monitoring shows an event loop delay of 3 hours (11005301 ms)! This cannot be accurate because response times are still at most 1000 ms.

This appears to be a bug in the event loop delay collector not the monitoring app because the same data is present in the monitoring-* indices.
Screenshot 2021-10-29 at 17 00 42

@rudolf rudolf added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.16.0 labels Oct 29, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@exalate-issue-sync exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:medium Medium Level of Effort and removed impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Nov 1, 2021
@exalate-issue-sync exalate-issue-sync bot added loe:small Small Level of Effort and removed loe:medium Medium Level of Effort labels Nov 9, 2021
@TinaHeiligers TinaHeiligers self-assigned this Nov 10, 2021
@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Nov 11, 2021

After digging into this it looks like the "bug" is related to the units in which we store the data:

Node's native monitorEventLoopDelay returns the data in nanoseconds:

"Creates an IntervalHistogram object that samples and reports the event loop delay over time. The delays will be reported in nanoseconds."

while Monitoring assumes it to be in milliseconds.

We have a few options here:

  1. Handle the conversion at data storage level: convert the raw data on collection from nanoseconds to milliseconds (and add a comment about why this is needed):
    const lastUpdated = new Date();
    this.loopMonitor.disable();
    const { min, max, mean, exceeds, stddev } = this.loopMonitor;

    const collectedData: IntervalHistogram = {
      min: min / 1000000,
      max: max / 1000000,
      mean: mean / 1000000,
      exceeds: exceeds / 1000000,
      stddev: stddev / 1000000,
      fromTimestamp: this.fromTimestamp.toISOString(),
      lastUpdatedAt: lastUpdated.toISOString(),
      percentiles: {
        50: this.loopMonitor.percentile(50) / 1000000,
        75: this.loopMonitor.percentile(75) / 1000000,
        95: this.loopMonitor.percentile(95) / 1000000,
        99: this.loopMonitor.percentile(99) / 1000000,
      },
    };

    this.loopMonitor.enable();
    return collectedData;
  }

event_loop_delay_metrics_ns_to_ms_converted_in_collector

Or
2. Handle ns -> ms metrics conversion on consumer side: i.e. Change units in event_loop_delay visualization to ns rather than ms

event_loop_delay_ns_label_change_in_monitoring

OR convert data from ns -> ms before display.

We convert from ns to ms in track_threshold but I'm not sure if there are other consumers who treat the data as ns rather than ms.

@rudolf WDYT?

@mshustov
Copy link
Contributor

mshustov commented Nov 11, 2021

Nanoseconds aren't human-friendly. It's hard to say whether 1080000ns or 108000ns is fast enough or not. However, the difference between 1ms and 0.1ms is easier to spot for a human eye.
Also, we aren't interested in small fluctuations of the event loop delay to justify nanoseconds usage.
IMO there is no good reason to change the current default, so I'd better do with the option 1. Even though, changes in event_loop_delay_histogram units might affect downstream plugins.

@TinaHeiligers
Copy link
Contributor

there is no good reason to change the current default

Is there a list of default units Kibana uses somewhere? If not, maybe it might be a good idea to document that, both for development guidance and for end-users.

@mshustov
Copy link
Contributor

mshustov commented Nov 11, 2021

If not, maybe it might be a good idea to document that, both for development guidance and for end-users.

I don't know to be honest, but I think there is no one-fits-all solution. The units are domain-specific: for performance metrics, we can use nanoseconds; for a backup copying period, we can specify seconds or minutes.
We can start with adding units in form of the comments for the Core public API. Probably, Monitoring caught the problem earlier if Core documented event_loop_delay_histogram uses nanoseconds units.

@rudolf
Copy link
Contributor Author

rudolf commented Nov 11, 2021

Can you test what happens on 7.15 or 7.14? I'm not sure when this regression was introduced, but it used to show the correct value. I would guess that we always used to store ms values but that the Nodejs v16 upgrade changed the units to ns.

If this is the case, I think we should continue to store these values in ms so that a user can compare event-loop monitoring data from before and after an upgrade from e.g. 7.14 to 7.16.

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Nov 12, 2021

Probably, Monitoring caught the problem earlier if Core documented event_loop_delay_histogram uses nanoseconds units.

@mshustov There is a comment in the IntervalHistogram interface.
To prevent this happening again, where is the best place to add the units? In the interface's keys?

Can you test what happens on 7.15 or 7.14?

@rudolf I spun up a 7.14 deployment on cloud and the order of magnitudes comparison between Event loop Delay and Client Response Times shows that the delays are about 2 orders of magnitude less than the response times
cloud_staging_7-14

For details of how this bug was introduced, see #118447

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.16.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants