Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.16.0 BC2] Potential memory leak observed - possibly caused by perf_hooks usage #116636

Closed
pmuellr opened this issue Oct 28, 2021 · 29 comments · Fixed by #117294
Closed

[7.16.0 BC2] Potential memory leak observed - possibly caused by perf_hooks usage #116636

pmuellr opened this issue Oct 28, 2021 · 29 comments · Fixed by #117294
Assignees
Labels
bug Fixes for quality problems that affect the customer experience estimate:medium Medium Estimated Level of Effort Feature:Task Manager Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Oct 28, 2021

I built a 7.16.0 BC1 deployment in our cloud staging env the other day, and loaded it up with my usual alerting load test - 100 index threshold rules running with a 1s interval - the rule is doing very little i/o itself, this is a "framework" test to make sure the framework is holding up. Deployment name is pmuellr 7.16.0 BC1

I noticed some memory growth over time:

image

I've restarted Kibana twice, to tweak the task manager config to get it to run rules more frequently, in hopes of accelerating the memory growth - expecting an OOM this weekend.

To check to see if this was alerting, or maybe a general Kibana issue, I built another deployment named pmuellr 7.16.0 BC1 do nothing. It is literally doing nothing :-) It looks like it is also slowly leaking, but maybe too soon to tell:

image

For reference, here is the same setup, but for 7.15.1, which I've been doing other testing with, so has a little more variablity the last few days, but obviously does not look like the memory metrics are increasing like the other two:

image

So my current guess is that this may be a leak in some core Kibana service, and with alerting driving more service calls, it's aggravating the memory leak. For example, in play here with alerting are Saved Objects, es client library, the new execution context (async hook) stuff, etc. Also Task Manager.

@pmuellr pmuellr added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Oct 28, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@ymao1 ymao1 self-assigned this Oct 28, 2021
@pmuellr
Copy link
Member Author

pmuellr commented Oct 28, 2021

an update on the first graph at the top:

image

The curve is starting to trend down!

So maybe this isn't a leak - maybe node v16 used in 7.16 (v14 in 7.15) is "smarter" about using memory - basically using as much as it possibly can, instead of trying to GC stuff so often. I was thinking that curve would go up and it would eventually OOM.

I tried running 7.16 locally, in dev mode (quickest way for me to get started), same rule load test, let it run overnight. Took a heap snapshot once the rules started running, and then again this morning. Nothing substantial in the comparison view! Which I first thought was maybe because dev mode was doing something different, or maybe the graphs we're seeing aren't what I'm used to seeing (for instance, these aren't RSS or heap sizes, some container based measurement of the memory used).

So ...perhaps nothing to see here. I'll poke around the node / v8 docs to see if there was a change to the memory management in v16

@rudolf
Copy link
Contributor

rudolf commented Oct 29, 2021

Once we've enabled monitoring we'll be better able to tell how much is heap vs RSS memory.

@ymao1
Copy link
Contributor

ymao1 commented Oct 29, 2021

For an apples to apples comparison, I created 4 clusters on GCP in the same region with the same configuration

xpack.task_manager.max_workers: 50
xpack.task_manager.poll_interval: 1000

I'm also using es-apm-sys-sim to generate sample data for the clusters with rules every 2 seconds, 25 hosts.

2 of them are on 7.15.1 and 2 are on 7.16.0 BC2. I loaded 150 rules running at 1s interval on a 7.15.1 deployment and a 7.16.0-BC2 deployment and let them run overnight. Here are the memory charts after 16 hours:

7.15.1 empty
Screen Shot 2021-10-29 at 7 58 15 AM

7.15.1 with rules
Screen Shot 2021-10-29 at 7 59 22 AM

7.16.0 empty
Screen Shot 2021-10-29 at 8 00 21 AM

7.16.0 with rules
Screen Shot 2021-10-29 at 8 00 56 AM

To summarize what I'm seeing:
On 7.15.1, memory stays steady around 37% utilization for an empty Kibana. With 150 rules running at a high interval, memory increases to around 52% utilization but holds steady around there, so an increase of 15%.

On 7.16.0-BC2, memory starts at 45% and has increased to 49% on an empty Kibana after 16 hours. With 150 rules running at a high interval, memory increases to around 60% utilization (a 15% increase) and steadily increases until ~83% utilization where it looks like some sort of garbage collection kicks in and drops it back down to 60% again. This took around 13 hours

@ymao1
Copy link
Contributor

ymao1 commented Oct 29, 2021

I've enabled monitoring on the 2 7.16.0 deployments. Monitoring cluster here: https://ymao-monitoring.kb.europe-west1.gcp.cloud.es.io:9243/

@rudolf
Copy link
Contributor

rudolf commented Oct 29, 2021

@ymao1
Copy link
Contributor

ymao1 commented Nov 1, 2021

Latest memory charts for 7.16. BC2 deployments (over the last 80 hours)

7.16.0 empty
Screen Shot 2021-11-01 at 7 50 04 AM

7.16.0 with rules
Screen Shot 2021-11-01 at 7 51 46 AM

@pmuellr
Copy link
Member Author

pmuellr commented Nov 1, 2021

I'm seeing similar graphs. None of the deployments I created last week to test this have OOM'd (BC1 and BC2, w/rules and w/o). The ones with rules have the sawtooth pattern, never hitting 100%. I suspect the ones without rules will have a sawtooth pattern looking over much longer ranges of time.

So ... not looking like a leak to me at this point - a change in GC behavior in v8/node? Delaying big GCs for longer?

@ymao1
Copy link
Contributor

ymao1 commented Nov 1, 2021

I agree that it is likely a change in behavior with Node16/v8

I ran some local deployments using I used Node's --expose-gc option to force garbage collection and then print out memory usage at the end of each alerting task runner run. Then I ran 50 rules at 3s intervals for 2 hours and compared the memory usage snapshot from the beginning to the end of the 2 hour run:

For 7.15.1:

rss -16.76MB
heapTotal -33MB
heapUsed -0.25MB

For 7.16.0 commit f96311fd9e23f56630a5cf16024c87b63f2efb8f right before NodeJS upgrade:

rss -13.91MB
heapTotal -34MB
heapUsed -1.43MB

For 7.16.0 commit e1503d4658ed02081faecbc794e3f00e461df38a - NodeJS upgrade:

rss +34.83MB
heapTotal +16.25MB
heapUsed +12.37MB

For 7.16.0 BC2

rss +39.96MB
heapTotal +15MB
heapUsed +13.21MB

Not the most scientific of tests but definitely shows to me the difference in how memory is used pre Node 16 vs post Node 16 upgrade.

@ymao1
Copy link
Contributor

ymao1 commented Nov 1, 2021

@rudolf @elastic/kibana-core Would you agree that this doesn't seem to be a memory leak, rather a change in how memory is handled by Node after the v16 upgrade? Is there anything further that alerting can do wrt to this investigation?

@pmuellr
Copy link
Member Author

pmuellr commented Nov 1, 2021

Same thoughts as Ying. Alerting provides a good way to force more memory usage, and we could probably tweak it to make it eat more faster, if we want to repro. But I assume pounding on Kibana with an http load tester would be just as easy.

Rudolf mentioned in conversation that this could be an issue for reporting, which launches Chromium instances. If a report is requested when node has more memory allocated (one of the peaks in the graph), it may not be enough to launch Chromium - compared to other times when it could. I'd guess we may want to see if there are some options to at least stabilize the amount of memory used - probably an option that would run gc more often (if that's what's going on here). Also seems like a signal that we probably don't want to be running multiple "apps" in the same container at the same time.

@pgayvallet
Copy link
Contributor

Let's see what @rudolf thinks about it, but personally I agree that is seems like it's just a change in node's memory management / garbage collection. Charts from #116636 (comment) and #116636 (comment) show that the memory goes back to a constant ceiling after every GC on 7.16, which is imho proof enough that the memory management is sane and we don't have a leak here.

@pmuellr pmuellr changed the title [7.16.0 BC1] memory leak observed, aggravated by alerting load tests [7.16.0 BC1] potential memory leak observed, aggravated by alerting load tests Nov 2, 2021
@rudolf
Copy link
Contributor

rudolf commented Nov 2, 2021

There is a slight increase when measuring directly after the GC, though because it only happens once a day this could be just normal variation. So I agree it doesn't look like a memory leak.

Screenshot 2021-11-02 at 14 23 37

Let's keep @ymao1 's cluster running for maybe another week just to be 100% sure.

@pmuellr
Copy link
Member Author

pmuellr commented Nov 2, 2021

OK, some new news.

Happened to be poking at one of my clusters, and realized it's been rebooting every 10 hours! I think I was expecting the graphs or something in the cloud UX to tell me about this, but never noticed anything, and didn't happen to be looking at the right log messages. This time I specifically searched for "Booting at" which is one of the first messages logged before all of the usual startup messages (this is on cloud).

The times seem to line up with the peaks in my memory graphs. So it seems like it's OOM'ing. This was for my 7.16.0 BC2 deployment that did have 200 rules running, but I actually deleted them yesterday.

My 7.16.0 BC2 "do nothing" deployment - created the deployment and haven't done anything with it since creation - rebooted yesterday. It lasted ~4 days. It's a 1GB Kibana deploy.

From that deploy, I'm seeing some repeated things in the logs:

  • no endpoint metadata transforms found a message from SIEM, every two hours
  • a set of visualization telemetry messages (I think), every hour, all occurring at the same time (below)
visualization telemetry messages, every hour
Error: Bad Request (stack trace below)
Error: Bad Request (stack trace below)
Error: Bad Request (stack trace below)
Error: Bad Request (stack trace below)
Failed to open PIT for types [visualization]
Unable to fetch data from vis_type_table collector
Unable to fetch data from vis_type_timeseries collector
Unable to fetch data from vis_type_vega collector
Unable to fetch data from visualization_types collector

stack trace for "Error: Bad Request"

    at Function.createBadRequestError (/usr/share/kibana/src/core/server/saved_objects/service/lib/errors.js:80:36)
    at SpacesSavedObjectsClient.openPointInTimeForType (/usr/share/kibana/x-pack/plugins/spaces/server/saved_objects/spaces_saved_objects_client.js:257:48)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at PointInTimeFinder.open (/usr/share/kibana/src/core/server/saved_objects/service/lib/point_in_time_finder.js:144:11)
    at PointInTimeFinder.find (/usr/share/kibana/src/core/server/saved_objects/service/lib/point_in_time_finder.js:99:5)
    at getStats (/usr/share/kibana/src/plugins/vis_types/table/server/usage_collector/get_stats.js:57:20)
    at /usr/share/kibana/src/plugins/usage_collection/server/collector/collector_set.js:104:21
    at async Promise.all (index 32)
    at Object.bulkFetch (/usr/share/kibana/src/plugins/usage_collection/server/collector/collector_set.js:91:25)
    at getKibana (/usr/share/kibana/src/plugins/telemetry/server/telemetry_collection/get_kibana.js:68:17)
    at async Promise.all (index 3)
    at /usr/share/kibana/src/plugins/telemetry/server/telemetry_collection/get_local_stats.js:79:76
    at async Promise.all (index 0)
    at getLocalStats (/usr/share/kibana/src/plugins/telemetry/server/telemetry_collection/get_local_stats.js:78:10)
    at async Promise.all (index 0) {
  isBoom: true,
  isServer: false,
  data: null,
  output: {
    statusCode: 400,
    payload: { statusCode: 400, error: 'Bad Request', message: 'Bad Request' },
    headers: {}
  },
  [Symbol(SavedObjectsClientErrorCode)]: 'SavedObjectsClient/badRequest'
}

Also noticed this in the logs, which looks bad:

MaxPerformanceEntryBufferExceededWarning: Possible perf_hooks memory leak detected. 1000001 mark entries added to the global performance entry buffer. Use performance.clearMarks to clear the buffer.
    at enqueue (node:internal/perf/observe:362:15)
    at EventTarget.mark (node:internal/perf/usertiming:95:3)
    at /usr/share/kibana/x-pack/plugins/task_manager/server/polling/task_poller.js:151:27
    at /usr/share/kibana/node_modules/lodash/lodash.js:10065:23
    at MergeMapSubscriber.project (/usr/share/kibana/x-pack/plugins/task_manager/server/polling/task_poller.js:92:5)

IIRC, we used to use some "perf hooks" in Node to get some performance numbers in task manager, but don't think we're using those numbers much anymore, not sure. My deployment that was running 200 rules was getting these messages twice per boot, separated by ~3 hours. Guessing it would have printed another if it hadn't rebooted - maybe the third one pushed it over the edge?

@pmuellr
Copy link
Member Author

pmuellr commented Nov 2, 2021

If this is a leak in perf_hooks somehow, that probably correlates with not seeing significant object memory increases in the heap dumps I compared over time. Because they're likely stored in native memory.

@ymao1
Copy link
Contributor

ymao1 commented Nov 2, 2021

Seeing the same thing :(

My 7.16 empty deployment does not have any "Booting at" messages but does have 2 of the MaxPerformanceEntryBufferExceededWarning over the last 4 days. My 7.15 deployments have neither of these messages in the logs.

@pgayvallet
Copy link
Contributor

We've added perf_hook usage in core in #101580, but this was merged for 7.14, so if that was the source of the leak, I would expect it to also surface in 7.15?

import type { IntervalHistogram as PerfIntervalHistogram } from 'perf_hooks';
import { monitorEventLoopDelay } from 'perf_hooks';
import type { IntervalHistogram } from '../types';

@rudolf
Copy link
Contributor

rudolf commented Nov 3, 2021

@rudolf
Copy link
Contributor

rudolf commented Nov 3, 2021

Task manager should clear mark's / measurements to avoid using up the performance buffer https://w3c.github.io/perf-timing-primer/#performance-mark

@rudolf
Copy link
Contributor

rudolf commented Nov 3, 2021

Nodejs 16.7.0 (nodejs/node#39782) included nodejs/node#39297 which changed the behaviour of perf hooks.

This means the user should invoke Performance.clearMarks and Performance.clearMeasures to clear buffered entries to prevent from those entries been kept alive forever.

So I feel rather confident that this is the root cause, good thinking @pmuellr 🏅

@pmuellr
Copy link
Member Author

pmuellr commented Nov 3, 2021

I did a little test last night, was able to create 12M marks in like 4s in a 1GB old space before an OOM. Seems like a big number, and not sure how many marks task manager is adding, but that's all that vm was doing.

Looks like task manager is the only code actually using marks, so I think the other uses of perf_hooks are probably safe for now.

@ymao1 ymao1 changed the title [7.16.0 BC1] potential memory leak observed, aggravated by alerting load tests [7.16.0 BC2] Potential memory leak observed - possibly caused by perf_hooks usage Nov 3, 2021
@ymao1
Copy link
Contributor

ymao1 commented Nov 3, 2021

Keeping this open so we can keep an eye on the BC3 build

@ymao1 ymao1 reopened this Nov 3, 2021
@ymao1
Copy link
Contributor

ymao1 commented Nov 4, 2021

Spun up some BC3 deployments:

7.16-BC3 empty

7.16-BC3 rules

Will update with charts after letting it run for a day

@pmuellr
Copy link
Member Author

pmuellr commented Nov 4, 2021

(OMG, this comment has been sitting in my browser for DAYS!!! Wops! So not really relevant anymore, but figured I'd note how I figured out marks were leaking)

Here's a script that will count how many performance "marks" you can make before it OOMs w/1GB old space. First run is on node v16 (OOMs), second is on node v14 (doesn't OOM).

perf-mark.js
const { performance } = require('perf_hooks')

const MarksPerChunk = 1000 * 1000 // ONE MILLION MARKS
let marks = 0
setInterval(printMarks, 1000)
setImmediate(add_marks_endlessly)

// make it async-y, to give printMarks a chance to run
function add_marks_endlessly() {
  for (let i=0; i<MarksPerChunk; i++) {
    performance.mark('another')
    marks++
  }
  setImmediate(add_marks_endlessly)
}

function printMarks() {
  console.log(`${new Date().toISOString()} ${marks / MarksPerChunk}M marks`)
}
$ ~/.nvm/versions/node/v16.13.0/bin/node --max_old_space_size=1024 perf-mark.js
(node:34698) MaxPerformanceEntryBufferExceededWarning: Possible perf_hooks memory leak detected. 1000001 mark entries added to the global performance entry buffer. Use performance.clearMarks to clear the buffer.
(Use `node --trace-warnings ...` to show where the warning was created)
2021-11-03T03:04:40.648Z 3M marks
2021-11-03T03:04:41.933Z 6M marks
2021-11-03T03:04:43.493Z 9M marks
2021-11-03T03:04:45.055Z 11M marks
2021-11-03T03:04:48.517Z 12M marks

<--- Last few GCs --->    <<<=== SIGN OF THE OOM!!!
...
(process exits)
$ ~/.nvm/versions/node/v14.17.6/bin/node --max_old_space_size=1024 perf-mark.js
2021-11-04T20:41:58.826Z 1M marks
2021-11-04T20:42:00.043Z 2M marks
2021-11-04T20:42:01.231Z 3M marks
2021-11-04T20:42:02.412Z 4M marks
2021-11-04T20:42:03.614Z 5M marks
2021-11-04T20:42:04.846Z 6M marks
2021-11-04T20:42:06.047Z 7M marks
2021-11-04T20:42:07.254Z 8M marks
2021-11-04T20:42:08.499Z 9M marks
...doesn't OOM ...

@rudolf
Copy link
Contributor

rudolf commented Nov 4, 2021

@ymao1 It doesn't look like you've enabled monitoring from the BC3, could you send the monitoring logs to https://ymao-monitoring.kb.europe-west1.gcp.cloud.es.io:9243/app/monitoring#/home?_g=(refreshInterval:(pause:!f,value:10000),time:(from:now-8d,to:now))

@ymao1
Copy link
Contributor

ymao1 commented Nov 4, 2021

@rudolf Oops! Just enabled.

@ymao1
Copy link
Contributor

ymao1 commented Nov 5, 2021

Looking good so far! Last 20 hours:

No MaxPerformanceEntryBufferExceededWarning messages in the logs. I see "Booting at" but they are from the instance reboots when I applied updated configs and turned on Logs & Metrics.

7.16-BC3 empty
Screen Shot 2021-11-05 at 7 59 40 AM

7.16-BC3 rules
Screen Shot 2021-11-05 at 7 59 54 AM

As reference:
7.16.0-BC2 empty
Screen Shot 2021-11-05 at 8 02 48 AM

7.16.0-BC2 with rules
Screen Shot 2021-11-05 at 8 03 02 AM

Will let it run over the weekend and update again after.

@ymao1
Copy link
Contributor

ymao1 commented Nov 8, 2021

The last 80 hours:

No MaxPerformanceEntryBufferExceededWarning or Booting at messages in the logs.

7.16-BC3 empty
Screen Shot 2021-11-08 at 7 38 59 AM

7.16-BC3 rules
Screen Shot 2021-11-08 at 7 39 25 AM

I am calling the issue fixed! Thanks @pmuellr & @rudolf!

@ymao1 ymao1 closed this as completed Nov 8, 2021
@gmmorris gmmorris added the estimate:medium Medium Estimated Level of Effort label Nov 15, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience estimate:medium Medium Estimated Level of Effort Feature:Task Manager Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants