[7.16.0 BC2] Potential memory leak observed - possibly caused by perf_hooks usage #116636

pmuellr · 2021-10-28T13:54:01Z

I built a 7.16.0 BC1 deployment in our cloud staging env the other day, and loaded it up with my usual alerting load test - 100 index threshold rules running with a 1s interval - the rule is doing very little i/o itself, this is a "framework" test to make sure the framework is holding up. Deployment name is pmuellr 7.16.0 BC1

I noticed some memory growth over time:

I've restarted Kibana twice, to tweak the task manager config to get it to run rules more frequently, in hopes of accelerating the memory growth - expecting an OOM this weekend.

To check to see if this was alerting, or maybe a general Kibana issue, I built another deployment named pmuellr 7.16.0 BC1 do nothing. It is literally doing nothing :-) It looks like it is also slowly leaking, but maybe too soon to tell:

For reference, here is the same setup, but for 7.15.1, which I've been doing other testing with, so has a little more variablity the last few days, but obviously does not look like the memory metrics are increasing like the other two:

So my current guess is that this may be a leak in some core Kibana service, and with alerting driving more service calls, it's aggravating the memory leak. For example, in play here with alerting are Saved Objects, es client library, the new execution context (async hook) stuff, etc. Also Task Manager.

elasticmachine · 2021-10-28T13:54:03Z

Pinging @elastic/kibana-core (Team:Core)

elasticmachine · 2021-10-28T13:54:03Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2021-10-28T20:24:21Z

an update on the first graph at the top:

The curve is starting to trend down!

So maybe this isn't a leak - maybe node v16 used in 7.16 (v14 in 7.15) is "smarter" about using memory - basically using as much as it possibly can, instead of trying to GC stuff so often. I was thinking that curve would go up and it would eventually OOM.

I tried running 7.16 locally, in dev mode (quickest way for me to get started), same rule load test, let it run overnight. Took a heap snapshot once the rules started running, and then again this morning. Nothing substantial in the comparison view! Which I first thought was maybe because dev mode was doing something different, or maybe the graphs we're seeing aren't what I'm used to seeing (for instance, these aren't RSS or heap sizes, some container based measurement of the memory used).

So ...perhaps nothing to see here. I'll poke around the node / v8 docs to see if there was a change to the memory management in v16

rudolf · 2021-10-29T09:35:12Z

Once we've enabled monitoring we'll be better able to tell how much is heap vs RSS memory.

ymao1 · 2021-10-29T12:18:52Z

For an apples to apples comparison, I created 4 clusters on GCP in the same region with the same configuration

xpack.task_manager.max_workers: 50
xpack.task_manager.poll_interval: 1000

I'm also using es-apm-sys-sim to generate sample data for the clusters with rules every 2 seconds, 25 hosts.

2 of them are on 7.15.1 and 2 are on 7.16.0 BC2. I loaded 150 rules running at 1s interval on a 7.15.1 deployment and a 7.16.0-BC2 deployment and let them run overnight. Here are the memory charts after 16 hours:

To summarize what I'm seeing:
On 7.15.1, memory stays steady around 37% utilization for an empty Kibana. With 150 rules running at a high interval, memory increases to around 52% utilization but holds steady around there, so an increase of 15%.

On 7.16.0-BC2, memory starts at 45% and has increased to 49% on an empty Kibana after 16 hours. With 150 rules running at a high interval, memory increases to around 60% utilization (a 15% increase) and steadily increases until ~83% utilization where it looks like some sort of garbage collection kicks in and drops it back down to 60% again. This took around 13 hours

ymao1 · 2021-10-29T12:54:31Z

I've enabled monitoring on the 2 7.16.0 deployments. Monitoring cluster here: https://ymao-monitoring.kb.europe-west1.gcp.cloud.es.io:9243/

rudolf · 2021-10-29T14:02:50Z

I've created the following visualization to show heap used vs RSS https://ymao-monitoring.kb.europe-west1.gcp.cloud.es.io:9243/app/lens#/edit/c69320d0-38c0-11ec-ad24-a7cfc3f6acda?_g=(filters:!(),refreshInterval:(pause:!f,value:10000),time:(from:now-1d,to:now))

ymao1 · 2021-11-01T11:52:37Z

Latest memory charts for 7.16. BC2 deployments (over the last 80 hours)

7.16.0 empty

7.16.0 with rules

pmuellr · 2021-11-01T12:01:21Z

I'm seeing similar graphs. None of the deployments I created last week to test this have OOM'd (BC1 and BC2, w/rules and w/o). The ones with rules have the sawtooth pattern, never hitting 100%. I suspect the ones without rules will have a sawtooth pattern looking over much longer ranges of time.

So ... not looking like a leak to me at this point - a change in GC behavior in v8/node? Delaying big GCs for longer?

ymao1 · 2021-11-01T12:07:54Z

I agree that it is likely a change in behavior with Node16/v8

I ran some local deployments using I used Node's --expose-gc option to force garbage collection and then print out memory usage at the end of each alerting task runner run. Then I ran 50 rules at 3s intervals for 2 hours and compared the memory usage snapshot from the beginning to the end of the 2 hour run:

For 7.15.1:

rss -16.76MB
heapTotal -33MB
heapUsed -0.25MB

For 7.16.0 commit f96311fd9e23f56630a5cf16024c87b63f2efb8f right before NodeJS upgrade:

rss -13.91MB
heapTotal -34MB
heapUsed -1.43MB

For 7.16.0 commit e1503d4658ed02081faecbc794e3f00e461df38a - NodeJS upgrade:

rss +34.83MB
heapTotal +16.25MB
heapUsed +12.37MB

For 7.16.0 BC2

rss +39.96MB
heapTotal +15MB
heapUsed +13.21MB

Not the most scientific of tests but definitely shows to me the difference in how memory is used pre Node 16 vs post Node 16 upgrade.

ymao1 · 2021-11-01T16:10:07Z

@rudolf @elastic/kibana-core Would you agree that this doesn't seem to be a memory leak, rather a change in how memory is handled by Node after the v16 upgrade? Is there anything further that alerting can do wrt to this investigation?

pmuellr · 2021-11-01T22:24:56Z

Same thoughts as Ying. Alerting provides a good way to force more memory usage, and we could probably tweak it to make it eat more faster, if we want to repro. But I assume pounding on Kibana with an http load tester would be just as easy.

Rudolf mentioned in conversation that this could be an issue for reporting, which launches Chromium instances. If a report is requested when node has more memory allocated (one of the peaks in the graph), it may not be enough to launch Chromium - compared to other times when it could. I'd guess we may want to see if there are some options to at least stabilize the amount of memory used - probably an option that would run gc more often (if that's what's going on here). Also seems like a signal that we probably don't want to be running multiple "apps" in the same container at the same time.

pgayvallet · 2021-11-02T07:30:53Z

Let's see what @rudolf thinks about it, but personally I agree that is seems like it's just a change in node's memory management / garbage collection. Charts from #116636 (comment) and #116636 (comment) show that the memory goes back to a constant ceiling after every GC on 7.16, which is imho proof enough that the memory management is sane and we don't have a leak here.

rudolf · 2021-11-02T13:27:13Z

There is a slight increase when measuring directly after the GC, though because it only happens once a day this could be just normal variation. So I agree it doesn't look like a memory leak.

Let's keep @ymao1 's cluster running for maybe another week just to be 100% sure.

pmuellr · 2021-11-02T22:09:06Z

OK, some new news.

Happened to be poking at one of my clusters, and realized it's been rebooting every 10 hours! I think I was expecting the graphs or something in the cloud UX to tell me about this, but never noticed anything, and didn't happen to be looking at the right log messages. This time I specifically searched for "Booting at" which is one of the first messages logged before all of the usual startup messages (this is on cloud).

The times seem to line up with the peaks in my memory graphs. So it seems like it's OOM'ing. This was for my 7.16.0 BC2 deployment that did have 200 rules running, but I actually deleted them yesterday.

My 7.16.0 BC2 "do nothing" deployment - created the deployment and haven't done anything with it since creation - rebooted yesterday. It lasted ~4 days. It's a 1GB Kibana deploy.

From that deploy, I'm seeing some repeated things in the logs:

no endpoint metadata transforms found a message from SIEM, every two hours
a set of visualization telemetry messages (I think), every hour, all occurring at the same time (below)

visualization telemetry messages, every hour

Error: Bad Request (stack trace below)
Error: Bad Request (stack trace below)
Error: Bad Request (stack trace below)
Error: Bad Request (stack trace below)
Failed to open PIT for types [visualization]
Unable to fetch data from vis_type_table collector
Unable to fetch data from vis_type_timeseries collector
Unable to fetch data from vis_type_vega collector
Unable to fetch data from visualization_types collector

stack trace for "Error: Bad Request"

    at Function.createBadRequestError (/usr/share/kibana/src/core/server/saved_objects/service/lib/errors.js:80:36)
    at SpacesSavedObjectsClient.openPointInTimeForType (/usr/share/kibana/x-pack/plugins/spaces/server/saved_objects/spaces_saved_objects_client.js:257:48)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at PointInTimeFinder.open (/usr/share/kibana/src/core/server/saved_objects/service/lib/point_in_time_finder.js:144:11)
    at PointInTimeFinder.find (/usr/share/kibana/src/core/server/saved_objects/service/lib/point_in_time_finder.js:99:5)
    at getStats (/usr/share/kibana/src/plugins/vis_types/table/server/usage_collector/get_stats.js:57:20)
    at /usr/share/kibana/src/plugins/usage_collection/server/collector/collector_set.js:104:21
    at async Promise.all (index 32)
    at Object.bulkFetch (/usr/share/kibana/src/plugins/usage_collection/server/collector/collector_set.js:91:25)
    at getKibana (/usr/share/kibana/src/plugins/telemetry/server/telemetry_collection/get_kibana.js:68:17)
    at async Promise.all (index 3)
    at /usr/share/kibana/src/plugins/telemetry/server/telemetry_collection/get_local_stats.js:79:76
    at async Promise.all (index 0)
    at getLocalStats (/usr/share/kibana/src/plugins/telemetry/server/telemetry_collection/get_local_stats.js:78:10)
    at async Promise.all (index 0) {
  isBoom: true,
  isServer: false,
  data: null,
  output: {
    statusCode: 400,
    payload: { statusCode: 400, error: 'Bad Request', message: 'Bad Request' },
    headers: {}
  },
  [Symbol(SavedObjectsClientErrorCode)]: 'SavedObjectsClient/badRequest'
}

Also noticed this in the logs, which looks bad:

MaxPerformanceEntryBufferExceededWarning: Possible perf_hooks memory leak detected. 1000001 mark entries added to the global performance entry buffer. Use performance.clearMarks to clear the buffer.
    at enqueue (node:internal/perf/observe:362:15)
    at EventTarget.mark (node:internal/perf/usertiming:95:3)
    at /usr/share/kibana/x-pack/plugins/task_manager/server/polling/task_poller.js:151:27
    at /usr/share/kibana/node_modules/lodash/lodash.js:10065:23
    at MergeMapSubscriber.project (/usr/share/kibana/x-pack/plugins/task_manager/server/polling/task_poller.js:92:5)

IIRC, we used to use some "perf hooks" in Node to get some performance numbers in task manager, but don't think we're using those numbers much anymore, not sure. My deployment that was running 200 rules was getting these messages twice per boot, separated by ~3 hours. Guessing it would have printed another if it hadn't rebooted - maybe the third one pushed it over the edge?

pmuellr · 2021-11-02T22:24:15Z

If this is a leak in perf_hooks somehow, that probably correlates with not seeing significant object memory increases in the heap dumps I compared over time. Because they're likely stored in native memory.

ymao1 · 2021-11-02T22:34:42Z

Seeing the same thing :(

My 7.16 empty deployment does not have any "Booting at" messages but does have 2 of the MaxPerformanceEntryBufferExceededWarning over the last 4 days. My 7.15 deployments have neither of these messages in the logs.

pgayvallet · 2021-11-03T07:32:53Z

We've added perf_hook usage in core in #101580, but this was merged for 7.14, so if that was the source of the leak, I would expect it to also surface in 7.15?

kibana/src/core/server/metrics/event_loop_delays/event_loop_delays_monitor.ts

Lines 9 to 11 in 7d66002

    
           import type { IntervalHistogram as PerfIntervalHistogram } from 'perf_hooks'; 
        
           import { monitorEventLoopDelay } from 'perf_hooks'; 
        
           import type { IntervalHistogram } from '../types';

rudolf · 2021-11-03T09:19:17Z

I think @pmuellr is suspecting which has been there since 7.6

https://github.com/elastic/kibana/blob/main/x-pack/plugins/task_manager/server/polling/task_poller.ts#L181-L189

rudolf · 2021-11-03T09:30:28Z

Task manager should clear mark's / measurements to avoid using up the performance buffer https://w3c.github.io/perf-timing-primer/#performance-mark

rudolf · 2021-11-03T09:48:47Z

Nodejs 16.7.0 (nodejs/node#39782) included nodejs/node#39297 which changed the behaviour of perf hooks.

This means the user should invoke Performance.clearMarks and Performance.clearMeasures to clear buffered entries to prevent from those entries been kept alive forever.

So I feel rather confident that this is the root cause, good thinking @pmuellr 🏅

pmuellr · 2021-11-03T11:52:49Z

I did a little test last night, was able to create 12M marks in like 4s in a 1GB old space before an OOM. Seems like a big number, and not sure how many marks task manager is adding, but that's all that vm was doing.

Looks like task manager is the only code actually using marks, so I think the other uses of perf_hooks are probably safe for now.

ymao1 · 2021-11-03T14:32:33Z

Keeping this open so we can keep an eye on the BC3 build

ymao1 · 2021-11-04T16:50:31Z

Spun up some BC3 deployments:

7.16-BC3 empty

7.16-BC3 rules

Will update with charts after letting it run for a day

pmuellr · 2021-11-04T20:48:54Z

(OMG, this comment has been sitting in my browser for DAYS!!! Wops! So not really relevant anymore, but figured I'd note how I figured out marks were leaking)

Here's a script that will count how many performance "marks" you can make before it OOMs w/1GB old space. First run is on node v16 (OOMs), second is on node v14 (doesn't OOM).

perf-mark.js

const { performance } = require('perf_hooks')

const MarksPerChunk = 1000 * 1000 // ONE MILLION MARKS
let marks = 0
setInterval(printMarks, 1000)
setImmediate(add_marks_endlessly)

// make it async-y, to give printMarks a chance to run
function add_marks_endlessly() {
  for (let i=0; i<MarksPerChunk; i++) {
    performance.mark('another')
    marks++
  }
  setImmediate(add_marks_endlessly)
}

function printMarks() {
  console.log(`${new Date().toISOString()} ${marks / MarksPerChunk}M marks`)
}

$ ~/.nvm/versions/node/v16.13.0/bin/node --max_old_space_size=1024 perf-mark.js
(node:34698) MaxPerformanceEntryBufferExceededWarning: Possible perf_hooks memory leak detected. 1000001 mark entries added to the global performance entry buffer. Use performance.clearMarks to clear the buffer.
(Use `node --trace-warnings ...` to show where the warning was created)
2021-11-03T03:04:40.648Z 3M marks
2021-11-03T03:04:41.933Z 6M marks
2021-11-03T03:04:43.493Z 9M marks
2021-11-03T03:04:45.055Z 11M marks
2021-11-03T03:04:48.517Z 12M marks

<--- Last few GCs --->    <<<=== SIGN OF THE OOM!!!
...
(process exits)

$ ~/.nvm/versions/node/v14.17.6/bin/node --max_old_space_size=1024 perf-mark.js
2021-11-04T20:41:58.826Z 1M marks
2021-11-04T20:42:00.043Z 2M marks
2021-11-04T20:42:01.231Z 3M marks
2021-11-04T20:42:02.412Z 4M marks
2021-11-04T20:42:03.614Z 5M marks
2021-11-04T20:42:04.846Z 6M marks
2021-11-04T20:42:06.047Z 7M marks
2021-11-04T20:42:07.254Z 8M marks
2021-11-04T20:42:08.499Z 9M marks
...doesn't OOM ...

rudolf · 2021-11-04T20:55:43Z

@ymao1 It doesn't look like you've enabled monitoring from the BC3, could you send the monitoring logs to https://ymao-monitoring.kb.europe-west1.gcp.cloud.es.io:9243/app/monitoring#/home?_g=(refreshInterval:(pause:!f,value:10000),time:(from:now-8d,to:now))

ymao1 · 2021-11-04T21:00:24Z

@rudolf Oops! Just enabled.

ymao1 · 2021-11-05T12:07:06Z

Looking good so far! Last 20 hours:

No MaxPerformanceEntryBufferExceededWarning messages in the logs. I see "Booting at" but they are from the instance reboots when I applied updated configs and turned on Logs & Metrics.

7.16-BC3 empty

7.16-BC3 rules

As reference:
7.16.0-BC2 empty

7.16.0-BC2 with rules

Will let it run over the weekend and update again after.

ymao1 · 2021-11-08T12:46:38Z

The last 80 hours:

No MaxPerformanceEntryBufferExceededWarning or Booting at messages in the logs.

7.16-BC3 empty

7.16-BC3 rules

I am calling the issue fixed! Thanks @pmuellr & @rudolf!

pmuellr added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Oct 28, 2021

ymao1 self-assigned this Oct 28, 2021

pmuellr changed the title ~~[7.16.0 BC1] memory leak observed, aggravated by alerting load tests~~ [7.16.0 BC1] potential memory leak observed, aggravated by alerting load tests Nov 2, 2021

ymao1 mentioned this issue Nov 3, 2021

[Task Manager] Removing perf_hooks from task manager #117294

Merged

ymao1 changed the title ~~[7.16.0 BC1] potential memory leak observed, aggravated by alerting load tests~~ [7.16.0 BC2] Potential memory leak observed - possibly caused by perf_hooks usage Nov 3, 2021

ymao1 closed this as completed in #117294 Nov 3, 2021

ymao1 reopened this Nov 3, 2021

gmmorris added the Feature:Task Manager label Nov 8, 2021

ymao1 closed this as completed Nov 8, 2021

rudolf mentioned this issue Nov 8, 2021

Undocumented breaking change in a minor nodejs/node#40755

Closed

pmuellr mentioned this issue Nov 8, 2021

[securitySolution] Jest memory leaks #117255

Closed

gmmorris added the estimate:medium Medium Estimated Level of Effort label Nov 15, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[7.16.0 BC2] Potential memory leak observed - possibly caused by perf_hooks usage #116636

[7.16.0 BC2] Potential memory leak observed - possibly caused by perf_hooks usage #116636

pmuellr commented Oct 28, 2021

elasticmachine commented Oct 28, 2021

elasticmachine commented Oct 28, 2021

pmuellr commented Oct 28, 2021

rudolf commented Oct 29, 2021

ymao1 commented Oct 29, 2021

ymao1 commented Oct 29, 2021

rudolf commented Oct 29, 2021

ymao1 commented Nov 1, 2021 •

edited

Loading

pmuellr commented Nov 1, 2021

ymao1 commented Nov 1, 2021

ymao1 commented Nov 1, 2021

pmuellr commented Nov 1, 2021

pgayvallet commented Nov 2, 2021

rudolf commented Nov 2, 2021

pmuellr commented Nov 2, 2021 •

edited

Loading

pmuellr commented Nov 2, 2021

ymao1 commented Nov 2, 2021

pgayvallet commented Nov 3, 2021

rudolf commented Nov 3, 2021 •

edited

Loading

rudolf commented Nov 3, 2021

rudolf commented Nov 3, 2021

pmuellr commented Nov 3, 2021

ymao1 commented Nov 3, 2021

ymao1 commented Nov 4, 2021

pmuellr commented Nov 4, 2021

rudolf commented Nov 4, 2021

ymao1 commented Nov 4, 2021

ymao1 commented Nov 5, 2021

ymao1 commented Nov 8, 2021

[7.16.0 BC2] Potential memory leak observed - possibly caused by perf_hooks usage #116636

[7.16.0 BC2] Potential memory leak observed - possibly caused by perf_hooks usage #116636

Comments

pmuellr commented Oct 28, 2021

elasticmachine commented Oct 28, 2021

elasticmachine commented Oct 28, 2021

pmuellr commented Oct 28, 2021

rudolf commented Oct 29, 2021

ymao1 commented Oct 29, 2021

ymao1 commented Oct 29, 2021

rudolf commented Oct 29, 2021

ymao1 commented Nov 1, 2021 • edited Loading

pmuellr commented Nov 1, 2021

ymao1 commented Nov 1, 2021

ymao1 commented Nov 1, 2021

pmuellr commented Nov 1, 2021

pgayvallet commented Nov 2, 2021

rudolf commented Nov 2, 2021

pmuellr commented Nov 2, 2021 • edited Loading

pmuellr commented Nov 2, 2021

ymao1 commented Nov 2, 2021

pgayvallet commented Nov 3, 2021

rudolf commented Nov 3, 2021 • edited Loading

rudolf commented Nov 3, 2021

rudolf commented Nov 3, 2021

pmuellr commented Nov 3, 2021

ymao1 commented Nov 3, 2021

ymao1 commented Nov 4, 2021

pmuellr commented Nov 4, 2021

rudolf commented Nov 4, 2021

ymao1 commented Nov 4, 2021

ymao1 commented Nov 5, 2021

ymao1 commented Nov 8, 2021

ymao1 commented Nov 1, 2021 •

edited

Loading

pmuellr commented Nov 2, 2021 •

edited

Loading

rudolf commented Nov 3, 2021 •

edited

Loading