-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[7.16.0 BC2] Potential memory leak observed - possibly caused by perf_hooks usage #116636
Comments
Pinging @elastic/kibana-core (Team:Core) |
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
an update on the first graph at the top: The curve is starting to trend down! So maybe this isn't a leak - maybe node v16 used in 7.16 (v14 in 7.15) is "smarter" about using memory - basically using as much as it possibly can, instead of trying to GC stuff so often. I was thinking that curve would go up and it would eventually OOM. I tried running 7.16 locally, in dev mode (quickest way for me to get started), same rule load test, let it run overnight. Took a heap snapshot once the rules started running, and then again this morning. Nothing substantial in the comparison view! Which I first thought was maybe because dev mode was doing something different, or maybe the graphs we're seeing aren't what I'm used to seeing (for instance, these aren't RSS or heap sizes, some container based measurement of the memory used). So ...perhaps nothing to see here. I'll poke around the node / v8 docs to see if there was a change to the memory management in v16 |
Once we've enabled monitoring we'll be better able to tell how much is heap vs RSS memory. |
For an apples to apples comparison, I created 4 clusters on GCP in the same region with the same configuration
I'm also using 2 of them are on 7.15.1 and 2 are on 7.16.0 BC2. I loaded 150 rules running at 1s interval on a 7.15.1 deployment and a 7.16.0-BC2 deployment and let them run overnight. Here are the memory charts after 16 hours: To summarize what I'm seeing: On 7.16.0-BC2, memory starts at 45% and has increased to 49% on an empty Kibana after 16 hours. With 150 rules running at a high interval, memory increases to around 60% utilization (a 15% increase) and steadily increases until ~83% utilization where it looks like some sort of garbage collection kicks in and drops it back down to 60% again. This took around 13 hours |
I've enabled monitoring on the 2 7.16.0 deployments. Monitoring cluster here: https://ymao-monitoring.kb.europe-west1.gcp.cloud.es.io:9243/ |
I've created the following visualization to show heap used vs RSS https://ymao-monitoring.kb.europe-west1.gcp.cloud.es.io:9243/app/lens#/edit/c69320d0-38c0-11ec-ad24-a7cfc3f6acda?_g=(filters:!(),refreshInterval:(pause:!f,value:10000),time:(from:now-1d,to:now)) |
Latest memory charts for 7.16. BC2 deployments (over the last 80 hours) |
I'm seeing similar graphs. None of the deployments I created last week to test this have OOM'd (BC1 and BC2, w/rules and w/o). The ones with rules have the sawtooth pattern, never hitting 100%. I suspect the ones without rules will have a sawtooth pattern looking over much longer ranges of time. So ... not looking like a leak to me at this point - a change in GC behavior in v8/node? Delaying big GCs for longer? |
I agree that it is likely a change in behavior with Node16/v8 I ran some local deployments using I used Node's For 7.15.1:
For 7.16.0 commit
For 7.16.0 commit
For 7.16.0 BC2
Not the most scientific of tests but definitely shows to me the difference in how memory is used pre Node 16 vs post Node 16 upgrade. |
@rudolf @elastic/kibana-core Would you agree that this doesn't seem to be a memory leak, rather a change in how memory is handled by Node after the v16 upgrade? Is there anything further that alerting can do wrt to this investigation? |
Same thoughts as Ying. Alerting provides a good way to force more memory usage, and we could probably tweak it to make it eat more faster, if we want to repro. But I assume pounding on Kibana with an http load tester would be just as easy. Rudolf mentioned in conversation that this could be an issue for reporting, which launches Chromium instances. If a report is requested when node has more memory allocated (one of the peaks in the graph), it may not be enough to launch Chromium - compared to other times when it could. I'd guess we may want to see if there are some options to at least stabilize the amount of memory used - probably an option that would run gc more often (if that's what's going on here). Also seems like a signal that we probably don't want to be running multiple "apps" in the same container at the same time. |
Let's see what @rudolf thinks about it, but personally I agree that is seems like it's just a change in node's memory management / garbage collection. Charts from #116636 (comment) and #116636 (comment) show that the memory goes back to a constant ceiling after every GC on 7.16, which is imho proof enough that the memory management is sane and we don't have a leak here. |
There is a slight increase when measuring directly after the GC, though because it only happens once a day this could be just normal variation. So I agree it doesn't look like a memory leak. Let's keep @ymao1 's cluster running for maybe another week just to be 100% sure. |
OK, some new news. Happened to be poking at one of my clusters, and realized it's been rebooting every 10 hours! I think I was expecting the graphs or something in the cloud UX to tell me about this, but never noticed anything, and didn't happen to be looking at the right log messages. This time I specifically searched for "Booting at" which is one of the first messages logged before all of the usual startup messages (this is on cloud). The times seem to line up with the peaks in my memory graphs. So it seems like it's OOM'ing. This was for my 7.16.0 BC2 deployment that did have 200 rules running, but I actually deleted them yesterday. My 7.16.0 BC2 "do nothing" deployment - created the deployment and haven't done anything with it since creation - rebooted yesterday. It lasted ~4 days. It's a 1GB Kibana deploy. From that deploy, I'm seeing some repeated things in the logs:
visualization telemetry messages, every hour
Also noticed this in the logs, which looks bad:
IIRC, we used to use some "perf hooks" in Node to get some performance numbers in task manager, but don't think we're using those numbers much anymore, not sure. My deployment that was running 200 rules was getting these messages twice per boot, separated by ~3 hours. Guessing it would have printed another if it hadn't rebooted - maybe the third one pushed it over the edge? |
If this is a leak in perf_hooks somehow, that probably correlates with not seeing significant object memory increases in the heap dumps I compared over time. Because they're likely stored in native memory. |
Seeing the same thing :( My 7.16 empty deployment does not have any "Booting at" messages but does have 2 of the |
We've added kibana/src/core/server/metrics/event_loop_delays/event_loop_delays_monitor.ts Lines 9 to 11 in 7d66002
|
I think @pmuellr is suspecting which has been there since 7.6 |
Task manager should clear mark's / measurements to avoid using up the performance buffer https://w3c.github.io/perf-timing-primer/#performance-mark |
Nodejs 16.7.0 (nodejs/node#39782) included nodejs/node#39297 which changed the behaviour of perf hooks.
So I feel rather confident that this is the root cause, good thinking @pmuellr 🏅 |
I did a little test last night, was able to create 12M marks in like 4s in a 1GB old space before an OOM. Seems like a big number, and not sure how many marks task manager is adding, but that's all that vm was doing. Looks like task manager is the only code actually using marks, so I think the other uses of |
Keeping this open so we can keep an eye on the BC3 build |
Spun up some BC3 deployments: Will update with charts after letting it run for a day |
(OMG, this comment has been sitting in my browser for DAYS!!! Wops! So not really relevant anymore, but figured I'd note how I figured out marks were leaking) Here's a script that will count how many performance "marks" you can make before it OOMs w/1GB old space. First run is on node v16 (OOMs), second is on node v14 (doesn't OOM).
|
@ymao1 It doesn't look like you've enabled monitoring from the BC3, could you send the monitoring logs to https://ymao-monitoring.kb.europe-west1.gcp.cloud.es.io:9243/app/monitoring#/home?_g=(refreshInterval:(pause:!f,value:10000),time:(from:now-8d,to:now)) |
@rudolf Oops! Just enabled. |
Looking good so far! Last 20 hours: No As reference: Will let it run over the weekend and update again after. |
The last 80 hours: No |
I built a 7.16.0 BC1 deployment in our cloud staging env the other day, and loaded it up with my usual alerting load test - 100 index threshold rules running with a 1s interval - the rule is doing very little i/o itself, this is a "framework" test to make sure the framework is holding up. Deployment name is pmuellr 7.16.0 BC1
I noticed some memory growth over time:
I've restarted Kibana twice, to tweak the task manager config to get it to run rules more frequently, in hopes of accelerating the memory growth - expecting an OOM this weekend.
To check to see if this was alerting, or maybe a general Kibana issue, I built another deployment named pmuellr 7.16.0 BC1 do nothing. It is literally doing nothing :-) It looks like it is also slowly leaking, but maybe too soon to tell:
For reference, here is the same setup, but for 7.15.1, which I've been doing other testing with, so has a little more variablity the last few days, but obviously does not look like the memory metrics are increasing like the other two:
So my current guess is that this may be a leak in some core Kibana service, and with alerting driving more service calls, it's aggravating the memory leak. For example, in play here with alerting are Saved Objects, es client library, the new execution context (async hook) stuff, etc. Also Task Manager.
The text was updated successfully, but these errors were encountered: