possible memory leak: config with hostmetrics, kubeletstats, prometheus recievers + transform/k8sattributes processors #36351
Labels
bug
Something isn't working
discussion needed
Community discussion needed
processor/transform
Transform processor
waiting for author
Component(s)
processor/transform
What happened?
Description
Hello! My organization has a helm deployment of opentelemetry collector, and we are seeing what I would describe as a memory leak with one particular daemonset tasked with ingesting prometheus, kubelet, and host metrics from it's node. We have worked around this issue by periodically restarting this workload.
The memory usage comes on very gradually; it takes about two weeks to build up, at which point CPU usage maxes out from a constant loop of garbage collection. At that point, metrics are refused due to this contention.
On August 2nd, we tried splitting the configuration into two daemonsets to isolate log forwarding from metrics when it reaches this condition. The log forwarding configuration does not have this problem.
We observed this issue both before an upgrade to
0.107.0
from0.92.0
and after a rollback back to0.92.0
to confirm that the memory issue was unrelated to the upgrade.I suspect but do not know that this issue comes out of our use of the transform processor, which is why I labeled the component that way. The reason I suspect is because we expanded our usage of it greatly on about July 13th, and the chart I believe shows that the memory issue rises to a problem level faster after this date.
Please see the chart below, going back to May 1st, for a visual on memory usage of our opentelemetry workloads. The
cluster-reciever
is a singleton pod for k8s cluster metrics and some high memory scrapes,logs-agent
is the split logs configuration, andcollector
is a gateway, but do not have issues.promql query for the chart seen below
Steps to Reproduce
We are able to reproduce this issue in lower environments, however, since the issue takes at least 14 days to show up, we cannot iterate very quickly here. Please find a complete configuration for the
metrics-agent
daemonset below.Details
I noticed that other memory leak issues usually require the reporter to post a heap pprof. I added pprof to our lower environments. Please find a heap dump of the oldest pod so instrumented (12 days old), unfortunately, it's not churning garbage collection yet though it's getting close.
Unfortunately, I'm running out of time to look at this issue, and I don't have much go experience to understand what I'm looking at in the heap dump. To work around, we have implemented an automatic restart on Mondays, hoping you can help.
Thank you so very much!
pprof.otelcol-contrib.samples.cpu.003.pb.gz
pprof.otelcol-contrib.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz
Expected Result
Garbage collection fully reclaims memory from routine operations
Actual Result
Garbage collection doesn't seem to affect some part of overall memory consumption.
Collector version
v0.92.0
Environment information
Environment
OS: GKE / ContainerOS
Compiler(if manually compiled): using public docker image
OpenTelemetry Collector configuration
Log output
Additional context
Although the
metrics-agent
is configured to receive logs, metrics, and traces over OTLP, it does not do so in practice at this time. None of our services emit otlp metrics to the metrics-agent, only to the gateway deployment, which does not have this issue. On the metrics agent, the ports aren't even exposed. It collects metric signals using hostmetrics, kubeletstats, and prometheus only.The text was updated successfully, but these errors were encountered: