-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SURE-3805] Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls) #1491
Comments
@moio commented: "Background: Fleet and Rancher use a leader election mechanism to ensure there is one pod alive per essential component: rancher, fleet-controller, fleet-agent (in the local cluster) and fleet-agent (in downstream clusters). Making the leader mechanism check frequently is a trade off between downtime after pod failure and API pressure. More frequent checks reduce downtime at the expense of seeing more frequent API calls. My take: the "retry period" is currently hardcoded to 2 seconds for all essential components mentioned above, and I believe is both too frequent and should be customizable. Tech details: both Rancher and Fleet use Wrangler's leader election. That, in turn, delegates to client-go's leaderelection, which uses config maps as lock objects. Those are periodically checked - the period currently being hardcoded to 2 seconds. According to git blame all the way back to norman 2.0 there was no compelling reason to choose the value of 2 seconds other than the fact it was the default value in Kubernetes' client-go code, and it still is to this day. From my personal perspective it is set way too low for Fleet components, especially downstream, as one does not typically expect very strict SLAs to deploy applications - a lengthy process in itself, and possibly too low for a default Rancher install as well. In any case users should have a way to influence that trade-off if there is a need. Concluding: Fleet engineers: please check my reasoning and assumptions above. Would you agree we need a patch in Wrangler to expose a configuration knob and making it tweakable from Fleet, with different defaults for Fleet components (and possibly user-overridable)?" |
rancher/wrangler#305 will make this configurable |
Internal reference to the follow up ticket to make leaderelection configurable from the Rancher UI: https://jira.suse.com/browse/SURE-6723 |
With rancher/wrangler#305 merged, this should now be solvable once #1717 is completed. |
https://github.com/rancher/wrangler/releases/tag/v2.1.2 contains rancher/wrangler#305, so now this depends solely on #1717 |
Internal reference: SURE-3805
Issue description:
For the top 10 calls to the kube-apiserver, Fleet locks account for 33% of these calls. Here's the top 10:
The first, third, and forth URIs in the list are for locks related to Fleet. The log file used for this top 10 had 69,128 requests over a 4+ hour period - "2021-12-13T09:30:07.389612Z" to "2021-12-13T13:54:50.179191Z". That amounts to a rate of 1.45 requests per second. Example JSON request:
Business impact:
This is likely impacting system performance.
Troubleshooting steps:
Ran "cat kube-apiserver-audit-log-2021-12-13T15-31-33.603.json | jq .requestURI | sort | uniq -c | sort -nr | head -10" to get top 10 requests.
Repro steps:
Install Rancher
Enable the kube-apiserver audit logs (doing this depends on the distro)
Workaround:
Is workaround available and implemented? no
What is the workaround:
Actual behavior:
Fleet lock API calls account for 1/3rd of all API calls, about 1.45 requests/second.
Expected behavior:
Fleet lock API calls would be significantly less.
Files, logs, traces:
full API audit logs available on request.
The text was updated successfully, but these errors were encountered: