Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-3805] Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls) #1491

Closed
kkaempf opened this issue Apr 26, 2023 · 5 comments
Assignees
Milestone

Comments

@kkaempf
Copy link
Collaborator

kkaempf commented Apr 26, 2023

Internal reference: SURE-3805

Issue description:

For the top 10 calls to the kube-apiserver, Fleet locks account for 33% of these calls. Here's the top 10:

   9399 "/api/v1/namespaces/cattle-fleet-local-system/configmaps/fleet-agent-lock"
   9374 "/api/v1/namespaces/kube-system/configmaps/cattle-controllers?timeout=15m0s"
   8227 "/api/v1/namespaces/cattle-fleet-system/configmaps/fleet-agent-lock"
   5387 "/api/v1/namespaces/cattle-fleet-system/configmaps/fleet-controller-lock"
   4933 "/api/v1/namespaces/kube-system/configmaps/k3s"
   4736 "/api/v1/namespaces/cattle-fleet-system/configmaps/gitjob"
   4220 "/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx"
   1588 "/api/v1/namespaces/kube-system/configmaps/coredns-autoscaler"
   1588 "/api/v1/namespaces/default/services/kubernetes"
   1588 "/api/v1/namespaces/default/endpoints/kubernetes"

The first, third, and forth URIs in the list are for locks related to Fleet. The log file used for this top 10 had 69,128 requests over a 4+ hour period - "2021-12-13T09:30:07.389612Z" to "2021-12-13T13:54:50.179191Z". That amounts to a rate of 1.45 requests per second. Example JSON request:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "RequestResponse",
  "auditID": "6d513d96-104c-47ad-be40-9c55b4261981",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/namespaces/cattle-fleet-local-system/configmaps/fleet-agent-lock",
  "verb": "get",
  "user": {
    "username": "system:serviceaccount:cattle-fleet-local-system:fleet-agent",
    "uid": "9d6c91c0-8e7f-4208-b449-b3ffe59781da",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:cattle-fleet-local-system",
      "system:authenticated"
    ]
  },
  "sourceIPs": [
    "172.16.0.17"
  ],
  "userAgent": "fleetagent/v0.0.0 (linux/amd64) kubernetes/$Format",
  "objectRef": {
    "resource": "configmaps",
    "namespace": "cattle-fleet-local-system",
    "name": "fleet-agent-lock",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "responseObject": {
    "kind": "ConfigMap",
    "apiVersion": "v1",
    "metadata": {
      "name": "fleet-agent-lock",
      "namespace": "cattle-fleet-local-system",
      "uid": "eee5e20e-af37-4228-a9aa-49e1135cc045",
      "resourceVersion": "29043327",
      "creationTimestamp": "2021-10-15T03:15:50Z",
      "annotations": {
        "control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"fleet-agent-59b74595c-tcmkd\",\"leaseDurationSeconds\":45,\"acquireTime\":\"2021-12-09T17:09:15Z\",\"renewTime\":\"2021-12-13T09:30:06Z\",\"leaderTransitions\":6}"
      },
      "managedFields": [
        {
          "manager": "fleetagent",
          "operation": "Update",
          "apiVersion": "v1",
          "time": "2021-10-15T03:15:50Z",
          "fieldsType": "FieldsV1",
          "fieldsV1": {
            "f:metadata": {
              "f:annotations": {
                ".": {},
                "f:control-plane.alpha.kubernetes.io/leader": {}
              }
            }
          }
        }
      ]
    }
  },
  "requestReceivedTimestamp": "2021-12-13T09:30:08.837117Z",
  "stageTimestamp": "2021-12-13T09:30:08.842463Z",
  "annotations": {
    "authentication.k8s.io/legacy-token": "system:serviceaccount:cattle-fleet-local-system:fleet-agent",
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"cattle-fleet-local-system-fleet-agent-role-binding\" of ClusterRole \"cattle-fleet-local-system-fleet-agent-role\" to ServiceAccount \"fleet-agent/cattle-fleet-local-system\""
  }
}

Business impact:

This is likely impacting system performance.

Troubleshooting steps:

Ran "cat kube-apiserver-audit-log-2021-12-13T15-31-33.603.json | jq .requestURI | sort | uniq -c | sort -nr | head -10" to get top 10 requests.

Repro steps:

Install Rancher
Enable the kube-apiserver audit logs (doing this depends on the distro)

Workaround:

Is workaround available and implemented? no
What is the workaround:

Actual behavior:

Fleet lock API calls account for 1/3rd of all API calls, about 1.45 requests/second.

Expected behavior:

Fleet lock API calls would be significantly less.

Files, logs, traces:

full API audit logs available on request.

@kkaempf
Copy link
Collaborator Author

kkaempf commented Apr 26, 2023

@moio commented:

"Background: Fleet and Rancher use a leader election mechanism to ensure there is one pod alive per essential component: rancher, fleet-controller, fleet-agent (in the local cluster) and fleet-agent (in downstream clusters). Making the leader mechanism check frequently is a trade off between downtime after pod failure and API pressure. More frequent checks reduce downtime at the expense of seeing more frequent API calls.

My take: the "retry period" is currently hardcoded to 2 seconds for all essential components mentioned above, and I believe is both too frequent and should be customizable.

Tech details: both Rancher and Fleet use Wrangler's leader election. That, in turn, delegates to client-go's leaderelection, which uses config maps as lock objects. Those are periodically checked - the period currently being hardcoded to 2 seconds.

According to git blame all the way back to norman 2.0 there was no compelling reason to choose the value of 2 seconds other than the fact it was the default value in Kubernetes' client-go code, and it still is to this day. From my personal perspective it is set way too low for Fleet components, especially downstream, as one does not typically expect very strict SLAs to deploy applications - a lengthy process in itself, and possibly too low for a default Rancher install as well. In any case users should have a way to influence that trade-off if there is a need.

Concluding: Fleet engineers: please check my reasoning and assumptions above. Would you agree we need a patch in Wrangler to expose a configuration knob and making it tweakable from Fleet, with different defaults for Fleet components (and possibly user-overridable)?"

@kkaempf kkaempf added this to the 2023-Q3-v2.7x milestone Apr 28, 2023
@manno manno moved this from 🆕 New to 📋 Backlog in Fleet Apr 28, 2023
@manno manno moved this from 📋 Backlog to Icebox🧊 in Fleet May 11, 2023
@moio
Copy link
Contributor

moio commented Jun 21, 2023

rancher/wrangler#305 will make this configurable

@moio moio self-assigned this Jun 21, 2023
@moio moio changed the title Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls) [SURE-3805] Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls) Jun 22, 2023
@kkaempf kkaempf modified the milestones: 2023-Q3-v2.7x, 2023-Q4-v2.8x Jul 24, 2023
@moio
Copy link
Contributor

moio commented Aug 1, 2023

Internal reference to the follow up ticket to make leaderelection configurable from the Rancher UI: https://jira.suse.com/browse/SURE-6723

@manno manno moved this from Icebox🧊 to Blocked in Fleet Aug 2, 2023
@moio
Copy link
Contributor

moio commented Sep 5, 2023

With rancher/wrangler#305 merged, this should now be solvable once #1717 is completed.

@moio
Copy link
Contributor

moio commented Oct 24, 2023

https://github.com/rancher/wrangler/releases/tag/v2.1.2 contains rancher/wrangler#305, so now this depends solely on #1717

@aruiz14 aruiz14 moved this from Blocked to 👀 In review in Fleet Dec 5, 2023
@aruiz14 aruiz14 closed this as completed Jan 10, 2024
@aruiz14 aruiz14 moved this from 👀 In review to ✅ Done in Fleet Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants