[BUG] Memory Leak in Operator version 2.4.0 #700

ThomasRasmussen · 2024-01-17T11:30:17Z

What is the bug?

After upgrading to Opensearch Operator 2.4.0 (from 2.3.2) we are experiencing ever-increasing memory usage and OOM killings of the operator pod. This appears to be escalated further when applying Opensearch resources (OpensearchUser, OpensearchRole and OpensearchRoleBindings)

How can one reproduce the bug?

Deploy the operator 2.4.0 and install a opensearch cluster. We are currently using OpenSearch version 2.7.0

What is the expected behavior?

Memory consumption of the Operator pod should be fairly stable.

What is your host/environment?

Azure Kubernetes cluster running Kubernetes version 1.26.6
OpenSearch k8s Operator 2.4.0

Do you have any screenshots?

Screenshot of memory usage by the operator before and after upgrade. Upgrade was performed January 9th and at that time the memory consumption started to increase until the pod was OOM-killed.

Do you have any additional context?

We have seen the same memory consumption pattern on 5 different kubernetes clusters, before update of operator the memory consumption was pretty stable. When the upgrade was performed memory consumption started to increase. The rate of which consumption increases seems to be related to the amount of Custom Opensearch Resources that has been deployed.

It appears that the same was seen in #612 but here no real solution was found/proposed, however we see memory leak constant on all clusters.

Memory increases more rapidly if we deploy more resources like this (replacing demo2 with demoXX).

apiVersion: opensearch.opster.io/v1
kind: OpensearchUser
metadata:
  name: demo2
  namespace: opensearch
spec:
  opensearchCluster:
    name: test-opensearch
  passwordFrom:
    name: opensearch-default-userpass
    key: password
  backendRoles:
    - opensearch-demo2-read
    - opensearch-demo2-write
---
apiVersion: opensearch.opster.io/v1
kind: OpensearchRole
metadata:
  name: opensearch-demo2-read
  namespace: opensearch
spec:
  opensearchCluster:
    name: test-opensearch
  clusterPermissions:
    - "cluster:monitor/nodes/info"
    - "cluster:monitor/health"
    - "indices:admin/template/get"
    - "indices:admin/exists"
  indexPermissions:
    - indexPatterns:
        - demo2*
      allowedActions:
        - read_access
---
apiVersion: opensearch.opster.io/v1
kind: OpensearchRole
metadata:
  name: opensearch-demo2-write
  namespace: opensearch
spec:
  opensearchCluster:
    name: test-opensearch
  clusterPermissions:
    - "cluster:monitor/nodes/info"
    - "cluster:monitor/health"
    - "indices:admin/template/get"
    - "indices:admin/exists"
    - "indices:admin/delete"
  indexPermissions:
    - indexPatterns:
        - demo2*
      allowedActions:
        - write_access
---        
apiVersion: opensearch.opster.io/v1
kind: OpensearchUserRoleBinding
metadata:
  name: demo2
  namespace: opensearch
spec:
  opensearchCluster:
    name: test-opensearch
  roles:
  - opensearch-demo2-read
  - opensearch-demo2-write
  users:
  - demo2

The text was updated successfully, but these errors were encountered:

ThomasRasmussen · 2024-01-18T08:21:33Z

I have now some more evidence that this may be related to the amount of resources mainted by the operator. I have 12 Users/roles/rolebindings defined, but when I remove all 12 resources, the memory usage changes. See attached screenshot, the resources was removed at around 7:30

kinseii · 2024-01-30T20:57:24Z

Hi everybody,

we have two OS clusters, production and development. And we have a similar problem with restarting the OS Operator controller.

I just discovered that I'm not getting OS logs from the opensearch operator controller. As soon as I have enough information for a debug I will share it here.

mrvdcsg · 2024-02-29T14:14:32Z

We ran a 2.4 cluster for about 6 months before upgrading to 2.5.1. It looks like 2.5.1 introduced a memory leak that wasn't previously there. The only change was that we started using ISM_Policies with 2.5.1 but otherwise the clusters remained the same. Same users, same load, same content, etc.
NAME READY STATUS RESTARTS AGE
opensearch-operator-controller-manager 2/2 Running 2217 (3h42m ago) 15d

prudhvigodithi · 2024-03-05T04:29:38Z

[Triage]
Thanks for reporting this, @mrvdcsg @kinseii @ThomasRasmussen I'm curious does adding resources to the operator pod stabilize it or continue with the memory leak error?
Adding @swoehrl-mw @jochenkressin @pchmielnik @salyh @bbarani
Should compare and test with the new changes pushed in v2.4.0 https://github.com/opensearch-project/opensearch-k8s-operator/releases/tag/v2.4.0

ThomasRasmussen · 2024-03-07T06:49:10Z

@prudhvigodithi
No. Adding more resources to the operator pod only delays the oom killing. Also the resources are deployed without any problems, so its not the actual addition of Opensearch CRs that is causing the memory leak, but indicates that for every Reconcile loop a small amount of memory is used but never released.

mrvdcsg · 2024-03-07T17:33:42Z

@prudhvigodithi No. Adding more resources to the operator pod only delays the oom killing. Also the resources are deployed without any problems, so its not the actual addition of Opensearch CRs that is causing the memory leak, but indicates that for every Reconcile loop a small amount of memory is used but never released.

I will second this. Increasing resources delays the OOMKILLED failure but it doesn't resolve it. I also had the hypothesis that the reconcile loop itself was holding onto memory as nothing else changed. I tested removing ISMPolicies and that resulted in no change. I reverted multiple clusters to 2.4.0 which results in the memory leak disappearing.

mrvdcsg · 2024-03-13T13:12:09Z

After reverting back to 2.4.0 (from 2.5.1), the memory leak issue completely goes away on all deployments. Unfortunately we did the upgrade to do ISM_Templates which we will have to handle an alternative way.

prudhvigodithi · 2024-03-25T20:52:47Z

Adding @swoehrl-mw @salyh @jochenkressin @pchmielnik for more insights, is it the same case when you use the latest version of operator ?

mrvdcsg · 2024-03-26T21:01:34Z

Adding @swoehrl-mw @salyh @jochenkressin @pchmielnik for more insights, is it the same case when you use the latest version of operator ?

The memory leak is only present in 2.5.1 Operator. I do not experience the issue with 2.4.0.

### Description There have been reports of memory leaks in the operator (#700). This PR adds an option to enable the [go pprof](https://pkg.go.dev/net/http/pprof) endpoints. With them users can get heap and allocation profiles that hopefully help in tracking down the leak. ### Issues Resolved Fixes #626 ### Check List - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [x] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]>

swoehrl-mw · 2024-06-10T07:10:37Z

Hi all. I did some analysis and testing the last days and I think I narrowed down the memory leak to kept open http client connections. I tested a fix over the weekend and could no longer see the memory leak. Fix is pushed as PR #842.

### Description The operator pod is suffering from memory leaks. After some analysis I think I have narrowed it down to connections for the http client being kept for reuse but never being used due to a new client being created in every reconcile run. This PR disables the connection keepalive/reuse and (at least in my experiments) prevents the memory leak. ### Issues Resolved Fixes #700 ### Check List - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [-] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]>

mrvdcsg · 2024-06-13T20:19:16Z

@swoehrl-mw I did a helm build from main and a fresh local deploy and memory usage went from 30MB right after launch to 260MB in 4 hours with an empty test cluster with no activity. Observations of all of our Operator v2.4 clusters show garbage collection keeping memory between 90MB-100MB on heavily utilized clusters (we have many instances). Thank you for your work closing http client connections but I'm still seeing indication of a leak. Are there additional steps you took to test this other than building the helm chart locally from main and installing from that helm chart?

Steps I did are below:

cd charts/opensearch-operator
helm package .
helm install opensearch-operator opensearch-operator-2.6.0.tgz

mrvdcsg · 2024-06-13T21:06:28Z

I noticed something in the opensearch-operator/pkg/reconcilers/configuration.go that it appears to create a new configmap every iteration of the reconciler but I couldn't find management of old configmap artifacts or their cleanup. Could this be a root cause of the leak?

swoehrl-mw · 2024-06-14T06:25:10Z

@mrvdcsg Did you also build your own operator docker image from main to use? Because the fix is in the go source code, so a new helm chart build will not solve anything.
If you don't want to build your own image, you can use mine (ghcr.io/maibornwolff/opensearch-operator:2.6.1-rc.1, 2.6.0 + some bug fixes including the leak fix) and use it with the 2.6.0 helm chart but override the manager image.

mrvdcsg · 2024-06-14T13:47:37Z

@swoehrl-mw Thank you for your response. I didn't build a fresh image so thank you for providing yours. I'll spin this up and test it on our end. Thank you for all your hard work on this!

mrvdcsg · 2024-06-14T23:24:48Z

Thank you for all your help. The memory leak appears to be gone!

prudhvigodithi · 2024-06-14T23:50:41Z

Great, @swoehrl-mw should we go with v2.6.1 release?

swoehrl-mw · 2024-06-17T06:07:56Z

@prudhvigodithi

should we go with v2.6.1 release?

Yes, we should get a bugfix release out soon. But since the main branch already has commits that are new features, the 2.6.1 release needs to be from a cherry-picked branch to only contain bug fixes.

prudhvigodithi · 2024-06-17T16:00:52Z

Thanks @swoehrl-mw I have created an issue for Release 2.6.1 #845, we can move the discussion there, can you please help with the cherry-pick and identify the right commits that should be part of the release, or please push a tag v2.6.1 and the automation should do the release.
Thanks
@salyh @getsaurabh02

### Description There have been reports of memory leaks in the operator (#700). This PR adds an option to enable the [go pprof](https://pkg.go.dev/net/http/pprof) endpoints. With them users can get heap and allocation profiles that hopefully help in tracking down the leak. ### Issues Resolved Fixes #626 ### Check List - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [x] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]> (cherry picked from commit 5a2d2ae)

The operator pod is suffering from memory leaks. After some analysis I think I have narrowed it down to connections for the http client being kept for reuse but never being used due to a new client being created in every reconcile run. This PR disables the connection keepalive/reuse and (at least in my experiments) prevents the memory leak. Fixes #700 - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [-] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]> (cherry picked from commit 56c9c8f)

### Description There have been reports of memory leaks in the operator (opensearch-project#700). This PR adds an option to enable the [go pprof](https://pkg.go.dev/net/http/pprof) endpoints. With them users can get heap and allocation profiles that hopefully help in tracking down the leak. ### Issues Resolved Fixes opensearch-project#626 ### Check List - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [x] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]> (cherry picked from commit 5a2d2ae)

…ch-project#842) The operator pod is suffering from memory leaks. After some analysis I think I have narrowed it down to connections for the http client being kept for reuse but never being used due to a new client being created in every reconcile run. This PR disables the connection keepalive/reuse and (at least in my experiments) prevents the memory leak. Fixes opensearch-project#700 - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [-] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]> (cherry picked from commit 56c9c8f)

ThomasRasmussen added bug Something isn't working untriaged Issues that have not yet been triaged labels Jan 17, 2024

prudhvigodithi removed the untriaged Issues that have not yet been triaged label Mar 5, 2024

prudhvigodithi mentioned this issue Mar 25, 2024

Rework snapshot repositories handling #752

Merged

mrvdcsg mentioned this issue Apr 25, 2024

Operator 2.6.0 Memory Leak #790

Closed

swoehrl-mw mentioned this issue May 14, 2024

Add option to enable pprof endpoints #813

Merged

3 tasks

swoehrl-mw mentioned this issue Jun 10, 2024

Disable http client connection reuse to prevent memory leak #842

Merged

2 tasks

swoehrl-mw closed this as completed in #842 Jun 13, 2024

prudhvigodithi mentioned this issue Jun 17, 2024

Release v2.6.1 #845

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Memory Leak in Operator version 2.4.0 #700

[BUG] Memory Leak in Operator version 2.4.0 #700

ThomasRasmussen commented Jan 17, 2024 •

edited

Loading

ThomasRasmussen commented Jan 18, 2024

kinseii commented Jan 30, 2024

mrvdcsg commented Feb 29, 2024 •

edited

Loading

prudhvigodithi commented Mar 5, 2024

ThomasRasmussen commented Mar 7, 2024

mrvdcsg commented Mar 7, 2024

mrvdcsg commented Mar 13, 2024

prudhvigodithi commented Mar 25, 2024 •

edited

Loading

mrvdcsg commented Mar 26, 2024

swoehrl-mw commented Jun 10, 2024

mrvdcsg commented Jun 13, 2024

mrvdcsg commented Jun 13, 2024

swoehrl-mw commented Jun 14, 2024

mrvdcsg commented Jun 14, 2024

mrvdcsg commented Jun 14, 2024

prudhvigodithi commented Jun 14, 2024

swoehrl-mw commented Jun 17, 2024

prudhvigodithi commented Jun 17, 2024

[BUG] Memory Leak in Operator version 2.4.0 #700

[BUG] Memory Leak in Operator version 2.4.0 #700

Comments

ThomasRasmussen commented Jan 17, 2024 • edited Loading

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

ThomasRasmussen commented Jan 18, 2024

kinseii commented Jan 30, 2024

mrvdcsg commented Feb 29, 2024 • edited Loading

prudhvigodithi commented Mar 5, 2024

ThomasRasmussen commented Mar 7, 2024

mrvdcsg commented Mar 7, 2024

mrvdcsg commented Mar 13, 2024

prudhvigodithi commented Mar 25, 2024 • edited Loading

mrvdcsg commented Mar 26, 2024

swoehrl-mw commented Jun 10, 2024

mrvdcsg commented Jun 13, 2024

mrvdcsg commented Jun 13, 2024

swoehrl-mw commented Jun 14, 2024

mrvdcsg commented Jun 14, 2024

mrvdcsg commented Jun 14, 2024

prudhvigodithi commented Jun 14, 2024

swoehrl-mw commented Jun 17, 2024

prudhvigodithi commented Jun 17, 2024

ThomasRasmussen commented Jan 17, 2024 •

edited

Loading

mrvdcsg commented Feb 29, 2024 •

edited

Loading

prudhvigodithi commented Mar 25, 2024 •

edited

Loading