-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Memory Leak in Operator version 2.4.0 #700
Comments
Hi everybody, we have two OS clusters, production and development. And we have a similar problem with restarting the OS Operator controller. I just discovered that I'm not getting OS logs from the opensearch operator controller. As soon as I have enough information for a debug I will share it here. |
We ran a 2.4 cluster for about 6 months before upgrading to 2.5.1. It looks like 2.5.1 introduced a memory leak that wasn't previously there. The only change was that we started using ISM_Policies with 2.5.1 but otherwise the clusters remained the same. Same users, same load, same content, etc. |
[Triage] |
@prudhvigodithi |
I will second this. Increasing resources delays the OOMKILLED failure but it doesn't resolve it. I also had the hypothesis that the reconcile loop itself was holding onto memory as nothing else changed. I tested removing ISMPolicies and that resulted in no change. I reverted multiple clusters to 2.4.0 which results in the memory leak disappearing. |
Adding @swoehrl-mw @salyh @jochenkressin @pchmielnik for more insights, is it the same case when you use the latest version of operator ? |
The memory leak is only present in 2.5.1 Operator. I do not experience the issue with 2.4.0. |
### Description There have been reports of memory leaks in the operator (#700). This PR adds an option to enable the [go pprof](https://pkg.go.dev/net/http/pprof) endpoints. With them users can get heap and allocation profiles that hopefully help in tracking down the leak. ### Issues Resolved Fixes #626 ### Check List - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [x] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]>
Hi all. I did some analysis and testing the last days and I think I narrowed down the memory leak to kept open http client connections. I tested a fix over the weekend and could no longer see the memory leak. Fix is pushed as PR #842. |
### Description The operator pod is suffering from memory leaks. After some analysis I think I have narrowed it down to connections for the http client being kept for reuse but never being used due to a new client being created in every reconcile run. This PR disables the connection keepalive/reuse and (at least in my experiments) prevents the memory leak. ### Issues Resolved Fixes #700 ### Check List - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [-] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]>
@swoehrl-mw I did a helm build from main and a fresh local deploy and memory usage went from 30MB right after launch to 260MB in 4 hours with an empty test cluster with no activity. Observations of all of our Operator v2.4 clusters show garbage collection keeping memory between 90MB-100MB on heavily utilized clusters (we have many instances). Thank you for your work closing http client connections but I'm still seeing indication of a leak. Are there additional steps you took to test this other than building the helm chart locally from main and installing from that helm chart? Steps I did are below:
|
I noticed something in the opensearch-operator/pkg/reconcilers/configuration.go that it appears to create a new configmap every iteration of the reconciler but I couldn't find management of old configmap artifacts or their cleanup. Could this be a root cause of the leak? |
@mrvdcsg Did you also build your own operator docker image from main to use? Because the fix is in the go source code, so a new helm chart build will not solve anything. |
@swoehrl-mw Thank you for your response. I didn't build a fresh image so thank you for providing yours. I'll spin this up and test it on our end. Thank you for all your hard work on this! |
Great, @swoehrl-mw should we go with v2.6.1 release? |
Yes, we should get a bugfix release out soon. But since the main branch already has commits that are new features, the 2.6.1 release needs to be from a cherry-picked branch to only contain bug fixes. |
Thanks @swoehrl-mw I have created an issue for Release 2.6.1 #845, we can move the discussion there, can you please help with the cherry-pick and identify the right commits that should be part of the release, or please push a tag |
### Description There have been reports of memory leaks in the operator (#700). This PR adds an option to enable the [go pprof](https://pkg.go.dev/net/http/pprof) endpoints. With them users can get heap and allocation profiles that hopefully help in tracking down the leak. ### Issues Resolved Fixes #626 ### Check List - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [x] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]> (cherry picked from commit 5a2d2ae)
The operator pod is suffering from memory leaks. After some analysis I think I have narrowed it down to connections for the http client being kept for reuse but never being used due to a new client being created in every reconcile run. This PR disables the connection keepalive/reuse and (at least in my experiments) prevents the memory leak. Fixes #700 - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [-] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]> (cherry picked from commit 56c9c8f)
### Description There have been reports of memory leaks in the operator (opensearch-project#700). This PR adds an option to enable the [go pprof](https://pkg.go.dev/net/http/pprof) endpoints. With them users can get heap and allocation profiles that hopefully help in tracking down the leak. ### Issues Resolved Fixes opensearch-project#626 ### Check List - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [x] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]> (cherry picked from commit 5a2d2ae)
…ch-project#842) The operator pod is suffering from memory leaks. After some analysis I think I have narrowed it down to connections for the http client being kept for reuse but never being used due to a new client being created in every reconcile run. This PR disables the connection keepalive/reuse and (at least in my experiments) prevents the memory leak. Fixes opensearch-project#700 - [x] Commits are signed per the DCO using --signoff - [-] Unittest added for the new/changed functionality and all unit tests are successful - [-] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]> (cherry picked from commit 56c9c8f)
What is the bug?
After upgrading to Opensearch Operator 2.4.0 (from 2.3.2) we are experiencing ever-increasing memory usage and OOM killings of the operator pod. This appears to be escalated further when applying Opensearch resources (OpensearchUser, OpensearchRole and OpensearchRoleBindings)
How can one reproduce the bug?
Deploy the operator 2.4.0 and install a opensearch cluster. We are currently using OpenSearch version 2.7.0
What is the expected behavior?
Memory consumption of the Operator pod should be fairly stable.
What is your host/environment?
Do you have any screenshots?
Screenshot of memory usage by the operator before and after upgrade. Upgrade was performed January 9th and at that time the memory consumption started to increase until the pod was OOM-killed.
Do you have any additional context?
We have seen the same memory consumption pattern on 5 different kubernetes clusters, before update of operator the memory consumption was pretty stable. When the upgrade was performed memory consumption started to increase. The rate of which consumption increases seems to be related to the amount of Custom Opensearch Resources that has been deployed.
It appears that the same was seen in #612 but here no real solution was found/proposed, however we see memory leak constant on all clusters.
Memory increases more rapidly if we deploy more resources like this (replacing demo2 with demoXX).
The text was updated successfully, but these errors were encountered: