Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory Leak in Operator version 2.4.0 #700

Closed
ThomasRasmussen opened this issue Jan 17, 2024 · 18 comments · Fixed by #842
Closed

[BUG] Memory Leak in Operator version 2.4.0 #700

ThomasRasmussen opened this issue Jan 17, 2024 · 18 comments · Fixed by #842
Labels
bug Something isn't working

Comments

@ThomasRasmussen
Copy link

ThomasRasmussen commented Jan 17, 2024

What is the bug?

After upgrading to Opensearch Operator 2.4.0 (from 2.3.2) we are experiencing ever-increasing memory usage and OOM killings of the operator pod. This appears to be escalated further when applying Opensearch resources (OpensearchUser, OpensearchRole and OpensearchRoleBindings)

How can one reproduce the bug?

Deploy the operator 2.4.0 and install a opensearch cluster. We are currently using OpenSearch version 2.7.0

What is the expected behavior?

Memory consumption of the Operator pod should be fairly stable.

What is your host/environment?

  • Azure Kubernetes cluster running Kubernetes version 1.26.6
  • OpenSearch k8s Operator 2.4.0

Do you have any screenshots?

Screenshot of memory usage by the operator before and after upgrade. Upgrade was performed January 9th and at that time the memory consumption started to increase until the pod was OOM-killed.

Screenshot 2024-01-17 at 12 16 37

Do you have any additional context?

We have seen the same memory consumption pattern on 5 different kubernetes clusters, before update of operator the memory consumption was pretty stable. When the upgrade was performed memory consumption started to increase. The rate of which consumption increases seems to be related to the amount of Custom Opensearch Resources that has been deployed.

It appears that the same was seen in #612 but here no real solution was found/proposed, however we see memory leak constant on all clusters.

Memory increases more rapidly if we deploy more resources like this (replacing demo2 with demoXX).

apiVersion: opensearch.opster.io/v1
kind: OpensearchUser
metadata:
  name: demo2
  namespace: opensearch
spec:
  opensearchCluster:
    name: test-opensearch
  passwordFrom:
    name: opensearch-default-userpass
    key: password
  backendRoles:
    - opensearch-demo2-read
    - opensearch-demo2-write
---
apiVersion: opensearch.opster.io/v1
kind: OpensearchRole
metadata:
  name: opensearch-demo2-read
  namespace: opensearch
spec:
  opensearchCluster:
    name: test-opensearch
  clusterPermissions:
    - "cluster:monitor/nodes/info"
    - "cluster:monitor/health"
    - "indices:admin/template/get"
    - "indices:admin/exists"
  indexPermissions:
    - indexPatterns:
        - demo2*
      allowedActions:
        - read_access
---
apiVersion: opensearch.opster.io/v1
kind: OpensearchRole
metadata:
  name: opensearch-demo2-write
  namespace: opensearch
spec:
  opensearchCluster:
    name: test-opensearch
  clusterPermissions:
    - "cluster:monitor/nodes/info"
    - "cluster:monitor/health"
    - "indices:admin/template/get"
    - "indices:admin/exists"
    - "indices:admin/delete"
  indexPermissions:
    - indexPatterns:
        - demo2*
      allowedActions:
        - write_access
---        
apiVersion: opensearch.opster.io/v1
kind: OpensearchUserRoleBinding
metadata:
  name: demo2
  namespace: opensearch
spec:
  opensearchCluster:
    name: test-opensearch
  roles:
  - opensearch-demo2-read
  - opensearch-demo2-write
  users:
  - demo2
@ThomasRasmussen ThomasRasmussen added bug Something isn't working untriaged Issues that have not yet been triaged labels Jan 17, 2024
@ThomasRasmussen
Copy link
Author

I have now some more evidence that this may be related to the amount of resources mainted by the operator. I have 12 Users/roles/rolebindings defined, but when I remove all 12 resources, the memory usage changes. See attached screenshot, the resources was removed at around 7:30

Screenshot 2024-01-18 at 09 20 33

@kinseii
Copy link

kinseii commented Jan 30, 2024

Hi everybody,

we have two OS clusters, production and development. And we have a similar problem with restarting the OS Operator controller.

I just discovered that I'm not getting OS logs from the opensearch operator controller. As soon as I have enough information for a debug I will share it here.

@mrvdcsg
Copy link

mrvdcsg commented Feb 29, 2024

We ran a 2.4 cluster for about 6 months before upgrading to 2.5.1. It looks like 2.5.1 introduced a memory leak that wasn't previously there. The only change was that we started using ISM_Policies with 2.5.1 but otherwise the clusters remained the same. Same users, same load, same content, etc.
NAME READY STATUS RESTARTS AGE
opensearch-operator-controller-manager 2/2 Running 2217 (3h42m ago) 15d

image

@prudhvigodithi
Copy link
Member

[Triage]
Thanks for reporting this, @mrvdcsg @kinseii @ThomasRasmussen I'm curious does adding resources to the operator pod stabilize it or continue with the memory leak error?
Adding @swoehrl-mw @jochenkressin @pchmielnik @salyh @bbarani
Should compare and test with the new changes pushed in v2.4.0 https://github.com/opensearch-project/opensearch-k8s-operator/releases/tag/v2.4.0

@prudhvigodithi prudhvigodithi removed the untriaged Issues that have not yet been triaged label Mar 5, 2024
@ThomasRasmussen
Copy link
Author

@prudhvigodithi
No. Adding more resources to the operator pod only delays the oom killing. Also the resources are deployed without any problems, so its not the actual addition of Opensearch CRs that is causing the memory leak, but indicates that for every Reconcile loop a small amount of memory is used but never released.

@mrvdcsg
Copy link

mrvdcsg commented Mar 7, 2024

@prudhvigodithi No. Adding more resources to the operator pod only delays the oom killing. Also the resources are deployed without any problems, so its not the actual addition of Opensearch CRs that is causing the memory leak, but indicates that for every Reconcile loop a small amount of memory is used but never released.

I will second this. Increasing resources delays the OOMKILLED failure but it doesn't resolve it. I also had the hypothesis that the reconcile loop itself was holding onto memory as nothing else changed. I tested removing ISMPolicies and that resulted in no change. I reverted multiple clusters to 2.4.0 which results in the memory leak disappearing.

@mrvdcsg
Copy link

mrvdcsg commented Mar 13, 2024

After reverting back to 2.4.0 (from 2.5.1), the memory leak issue completely goes away on all deployments. Unfortunately we did the upgrade to do ISM_Templates which we will have to handle an alternative way.
image

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Mar 25, 2024

Adding @swoehrl-mw @salyh @jochenkressin @pchmielnik for more insights, is it the same case when you use the latest version of operator ?

@mrvdcsg
Copy link

mrvdcsg commented Mar 26, 2024

Adding @swoehrl-mw @salyh @jochenkressin @pchmielnik for more insights, is it the same case when you use the latest version of operator ?

The memory leak is only present in 2.5.1 Operator. I do not experience the issue with 2.4.0.

swoehrl-mw added a commit that referenced this issue May 15, 2024
### Description
There have been reports of memory leaks in the operator (#700). This PR
adds an option to enable the [go
pprof](https://pkg.go.dev/net/http/pprof) endpoints. With them users can
get heap and allocation profiles that hopefully help in tracking down
the leak.

### Issues Resolved
Fixes #626 

### Check List
- [x] Commits are signed per the DCO using --signoff 
- [-] Unittest added for the new/changed functionality and all unit
tests are successful
- [x] Customer-visible features documented
- [x] No linter warnings (`make lint`)

If CRDs are changed:
- [-] CRD YAMLs updated (`make manifests`) and also copied into the helm
chart
- [-] Changes to CRDs documented

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and
signing off your commits, please check
[here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).

Signed-off-by: Sebastian Woehrl <[email protected]>
@swoehrl-mw
Copy link
Collaborator

Hi all. I did some analysis and testing the last days and I think I narrowed down the memory leak to kept open http client connections. I tested a fix over the weekend and could no longer see the memory leak. Fix is pushed as PR #842.

swoehrl-mw added a commit that referenced this issue Jun 13, 2024
### Description
The operator pod is suffering from memory leaks. After some analysis I
think I have narrowed it down to connections for the http client being
kept for reuse but never being used due to a new client being created in
every reconcile run.
This PR disables the connection keepalive/reuse and (at least in my
experiments) prevents the memory leak.

### Issues Resolved
Fixes #700 

### Check List
- [x] Commits are signed per the DCO using --signoff 
- [-] Unittest added for the new/changed functionality and all unit
tests are successful
- [-] Customer-visible features documented
- [x] No linter warnings (`make lint`)

If CRDs are changed:
- [-] CRD YAMLs updated (`make manifests`) and also copied into the helm
chart
- [-] Changes to CRDs documented

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and
signing off your commits, please check
[here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).

Signed-off-by: Sebastian Woehrl <[email protected]>
@mrvdcsg
Copy link

mrvdcsg commented Jun 13, 2024

@swoehrl-mw I did a helm build from main and a fresh local deploy and memory usage went from 30MB right after launch to 260MB in 4 hours with an empty test cluster with no activity. Observations of all of our Operator v2.4 clusters show garbage collection keeping memory between 90MB-100MB on heavily utilized clusters (we have many instances). Thank you for your work closing http client connections but I'm still seeing indication of a leak. Are there additional steps you took to test this other than building the helm chart locally from main and installing from that helm chart?

Steps I did are below:

cd charts/opensearch-operator
helm package .
helm install opensearch-operator opensearch-operator-2.6.0.tgz 

@mrvdcsg
Copy link

mrvdcsg commented Jun 13, 2024

I noticed something in the opensearch-operator/pkg/reconcilers/configuration.go that it appears to create a new configmap every iteration of the reconciler but I couldn't find management of old configmap artifacts or their cleanup. Could this be a root cause of the leak?

@swoehrl-mw
Copy link
Collaborator

@mrvdcsg Did you also build your own operator docker image from main to use? Because the fix is in the go source code, so a new helm chart build will not solve anything.
If you don't want to build your own image, you can use mine (ghcr.io/maibornwolff/opensearch-operator:2.6.1-rc.1, 2.6.0 + some bug fixes including the leak fix) and use it with the 2.6.0 helm chart but override the manager image.

@mrvdcsg
Copy link

mrvdcsg commented Jun 14, 2024

@swoehrl-mw Thank you for your response. I didn't build a fresh image so thank you for providing yours. I'll spin this up and test it on our end. Thank you for all your hard work on this!

@mrvdcsg
Copy link

mrvdcsg commented Jun 14, 2024

Thank you for all your help. The memory leak appears to be gone!
image

@prudhvigodithi
Copy link
Member

Great, @swoehrl-mw should we go with v2.6.1 release?

@swoehrl-mw
Copy link
Collaborator

@prudhvigodithi

should we go with v2.6.1 release?

Yes, we should get a bugfix release out soon. But since the main branch already has commits that are new features, the 2.6.1 release needs to be from a cherry-picked branch to only contain bug fixes.

@prudhvigodithi
Copy link
Member

Thanks @swoehrl-mw I have created an issue for Release 2.6.1 #845, we can move the discussion there, can you please help with the cherry-pick and identify the right commits that should be part of the release, or please push a tag v2.6.1 and the automation should do the release.
Thanks
@salyh @getsaurabh02

swoehrl-mw added a commit that referenced this issue Jun 18, 2024
### Description
There have been reports of memory leaks in the operator (#700). This PR
adds an option to enable the [go
pprof](https://pkg.go.dev/net/http/pprof) endpoints. With them users can
get heap and allocation profiles that hopefully help in tracking down
the leak.

### Issues Resolved
Fixes #626

### Check List
- [x] Commits are signed per the DCO using --signoff
- [-] Unittest added for the new/changed functionality and all unit
tests are successful
- [x] Customer-visible features documented
- [x] No linter warnings (`make lint`)

If CRDs are changed:
- [-] CRD YAMLs updated (`make manifests`) and also copied into the helm
chart
- [-] Changes to CRDs documented

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and
signing off your commits, please check
[here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).

Signed-off-by: Sebastian Woehrl <[email protected]>
(cherry picked from commit 5a2d2ae)
swoehrl-mw added a commit that referenced this issue Jun 18, 2024
The operator pod is suffering from memory leaks. After some analysis I
think I have narrowed it down to connections for the http client being
kept for reuse but never being used due to a new client being created in
every reconcile run.
This PR disables the connection keepalive/reuse and (at least in my
experiments) prevents the memory leak.

Fixes #700

- [x] Commits are signed per the DCO using --signoff
- [-] Unittest added for the new/changed functionality and all unit
tests are successful
- [-] Customer-visible features documented
- [x] No linter warnings (`make lint`)

If CRDs are changed:
- [-] CRD YAMLs updated (`make manifests`) and also copied into the helm
chart
- [-] Changes to CRDs documented

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and
signing off your commits, please check
[here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).

Signed-off-by: Sebastian Woehrl <[email protected]>
(cherry picked from commit 56c9c8f)
swoehrl-mw added a commit to MaibornWolff/opensearch-operator that referenced this issue Jul 2, 2024
### Description
There have been reports of memory leaks in the operator (opensearch-project#700). This PR
adds an option to enable the [go
pprof](https://pkg.go.dev/net/http/pprof) endpoints. With them users can
get heap and allocation profiles that hopefully help in tracking down
the leak.

### Issues Resolved
Fixes opensearch-project#626

### Check List
- [x] Commits are signed per the DCO using --signoff
- [-] Unittest added for the new/changed functionality and all unit
tests are successful
- [x] Customer-visible features documented
- [x] No linter warnings (`make lint`)

If CRDs are changed:
- [-] CRD YAMLs updated (`make manifests`) and also copied into the helm
chart
- [-] Changes to CRDs documented

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and
signing off your commits, please check
[here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).

Signed-off-by: Sebastian Woehrl <[email protected]>
(cherry picked from commit 5a2d2ae)
swoehrl-mw added a commit to MaibornWolff/opensearch-operator that referenced this issue Jul 2, 2024
…ch-project#842)

The operator pod is suffering from memory leaks. After some analysis I
think I have narrowed it down to connections for the http client being
kept for reuse but never being used due to a new client being created in
every reconcile run.
This PR disables the connection keepalive/reuse and (at least in my
experiments) prevents the memory leak.

Fixes opensearch-project#700

- [x] Commits are signed per the DCO using --signoff
- [-] Unittest added for the new/changed functionality and all unit
tests are successful
- [-] Customer-visible features documented
- [x] No linter warnings (`make lint`)

If CRDs are changed:
- [-] CRD YAMLs updated (`make manifests`) and also copied into the helm
chart
- [-] Changes to CRDs documented

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and
signing off your commits, please check
[here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).

Signed-off-by: Sebastian Woehrl <[email protected]>
(cherry picked from commit 56c9c8f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants