[SURE-3805] Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls) #1491

kkaempf · 2023-04-26T10:13:32Z

Internal reference: SURE-3805

Issue description:

For the top 10 calls to the kube-apiserver, Fleet locks account for 33% of these calls. Here's the top 10:

   9399 "/api/v1/namespaces/cattle-fleet-local-system/configmaps/fleet-agent-lock"
   9374 "/api/v1/namespaces/kube-system/configmaps/cattle-controllers?timeout=15m0s"
   8227 "/api/v1/namespaces/cattle-fleet-system/configmaps/fleet-agent-lock"
   5387 "/api/v1/namespaces/cattle-fleet-system/configmaps/fleet-controller-lock"
   4933 "/api/v1/namespaces/kube-system/configmaps/k3s"
   4736 "/api/v1/namespaces/cattle-fleet-system/configmaps/gitjob"
   4220 "/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx"
   1588 "/api/v1/namespaces/kube-system/configmaps/coredns-autoscaler"
   1588 "/api/v1/namespaces/default/services/kubernetes"
   1588 "/api/v1/namespaces/default/endpoints/kubernetes"

The first, third, and forth URIs in the list are for locks related to Fleet. The log file used for this top 10 had 69,128 requests over a 4+ hour period - "2021-12-13T09:30:07.389612Z" to "2021-12-13T13:54:50.179191Z". That amounts to a rate of 1.45 requests per second. Example JSON request:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "RequestResponse",
  "auditID": "6d513d96-104c-47ad-be40-9c55b4261981",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/namespaces/cattle-fleet-local-system/configmaps/fleet-agent-lock",
  "verb": "get",
  "user": {
    "username": "system:serviceaccount:cattle-fleet-local-system:fleet-agent",
    "uid": "9d6c91c0-8e7f-4208-b449-b3ffe59781da",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:cattle-fleet-local-system",
      "system:authenticated"
    ]
  },
  "sourceIPs": [
    "172.16.0.17"
  ],
  "userAgent": "fleetagent/v0.0.0 (linux/amd64) kubernetes/$Format",
  "objectRef": {
    "resource": "configmaps",
    "namespace": "cattle-fleet-local-system",
    "name": "fleet-agent-lock",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "responseObject": {
    "kind": "ConfigMap",
    "apiVersion": "v1",
    "metadata": {
      "name": "fleet-agent-lock",
      "namespace": "cattle-fleet-local-system",
      "uid": "eee5e20e-af37-4228-a9aa-49e1135cc045",
      "resourceVersion": "29043327",
      "creationTimestamp": "2021-10-15T03:15:50Z",
      "annotations": {
        "control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"fleet-agent-59b74595c-tcmkd\",\"leaseDurationSeconds\":45,\"acquireTime\":\"2021-12-09T17:09:15Z\",\"renewTime\":\"2021-12-13T09:30:06Z\",\"leaderTransitions\":6}"
      },
      "managedFields": [
        {
          "manager": "fleetagent",
          "operation": "Update",
          "apiVersion": "v1",
          "time": "2021-10-15T03:15:50Z",
          "fieldsType": "FieldsV1",
          "fieldsV1": {
            "f:metadata": {
              "f:annotations": {
                ".": {},
                "f:control-plane.alpha.kubernetes.io/leader": {}
              }
            }
          }
        }
      ]
    }
  },
  "requestReceivedTimestamp": "2021-12-13T09:30:08.837117Z",
  "stageTimestamp": "2021-12-13T09:30:08.842463Z",
  "annotations": {
    "authentication.k8s.io/legacy-token": "system:serviceaccount:cattle-fleet-local-system:fleet-agent",
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"cattle-fleet-local-system-fleet-agent-role-binding\" of ClusterRole \"cattle-fleet-local-system-fleet-agent-role\" to ServiceAccount \"fleet-agent/cattle-fleet-local-system\""
  }
}

Business impact:

This is likely impacting system performance.

Troubleshooting steps:

Repro steps:

Install Rancher
Enable the kube-apiserver audit logs (doing this depends on the distro)

Workaround:

Is workaround available and implemented? no
What is the workaround:

Actual behavior:

Fleet lock API calls account for 1/3rd of all API calls, about 1.45 requests/second.

Expected behavior:

Fleet lock API calls would be significantly less.

Files, logs, traces:

full API audit logs available on request.

The text was updated successfully, but these errors were encountered:

kkaempf · 2023-04-26T10:14:59Z

@moio commented:

"Background: Fleet and Rancher use a leader election mechanism to ensure there is one pod alive per essential component: rancher, fleet-controller, fleet-agent (in the local cluster) and fleet-agent (in downstream clusters). Making the leader mechanism check frequently is a trade off between downtime after pod failure and API pressure. More frequent checks reduce downtime at the expense of seeing more frequent API calls.

My take: the "retry period" is currently hardcoded to 2 seconds for all essential components mentioned above, and I believe is both too frequent and should be customizable.

Tech details: both Rancher and Fleet use Wrangler's leader election. That, in turn, delegates to client-go's leaderelection, which uses config maps as lock objects. Those are periodically checked - the period currently being hardcoded to 2 seconds.

According to git blame all the way back to norman 2.0 there was no compelling reason to choose the value of 2 seconds other than the fact it was the default value in Kubernetes' client-go code, and it still is to this day. From my personal perspective it is set way too low for Fleet components, especially downstream, as one does not typically expect very strict SLAs to deploy applications - a lengthy process in itself, and possibly too low for a default Rancher install as well. In any case users should have a way to influence that trade-off if there is a need.

Concluding: Fleet engineers: please check my reasoning and assumptions above. Would you agree we need a patch in Wrangler to expose a configuration knob and making it tweakable from Fleet, with different defaults for Fleet components (and possibly user-overridable)?"

moio · 2023-06-21T08:09:29Z

rancher/wrangler#305 will make this configurable

moio · 2023-08-01T12:18:52Z

Internal reference to the follow up ticket to make leaderelection configurable from the Rancher UI: https://jira.suse.com/browse/SURE-6723

moio · 2023-09-05T07:24:14Z

With rancher/wrangler#305 merged, this should now be solvable once #1717 is completed.

moio · 2023-10-24T07:21:42Z

https://github.com/rancher/wrangler/releases/tag/v2.1.2 contains rancher/wrangler#305, so now this depends solely on #1717

kkaempf added area/performance kind/bug JIRA Must shout labels Apr 26, 2023

kkaempf added this to Fleet Apr 26, 2023

github-project-automation bot moved this to 🆕 New in Fleet Apr 26, 2023

github-actions bot added area/fleet labels Apr 26, 2023

kkaempf added this to the 2023-Q3-v2.7x milestone Apr 28, 2023

manno moved this from 🆕 New to 📋 Backlog in Fleet Apr 28, 2023

manno moved this from 📋 Backlog to Icebox🧊 in Fleet May 11, 2023

moio mentioned this issue Jun 21, 2023

leaderelection: configure all timeouts via environment variables rancher/wrangler#305

Merged

moio self-assigned this Jun 21, 2023

moio changed the title ~~Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls)~~ [SURE-3805] Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls) Jun 22, 2023

kkaempf modified the milestones: 2023-Q3-v2.7x, 2023-Q4-v2.8x Jul 24, 2023

manno moved this from Icebox🧊 to Blocked in Fleet Aug 2, 2023

kkaempf modified the milestones: v2.8.0, 2024-Q1-2.8x Oct 4, 2023

manno mentioned this issue Oct 5, 2023

Convert Fleet-Controller to StatefulSet #1837

Closed

manno modified the milestones: 2024-Q1-2.8x, v2.9.0 Nov 27, 2023

manno mentioned this issue Nov 27, 2023

Convert fleet-agent to controller-runtime #1772

Merged

11 tasks

aruiz14 mentioned this issue Nov 30, 2023

Configurable leader election via chart values #1981

Merged

moio assigned aruiz14 and unassigned moio Dec 1, 2023

aruiz14 moved this from Blocked to 👀 In review in Fleet Dec 5, 2023

aruiz14 closed this as completed Jan 10, 2024

aruiz14 moved this from 👀 In review to ✅ Done in Fleet Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SURE-3805] Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls) #1491

[SURE-3805] Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls) #1491

kkaempf commented Apr 26, 2023 •

edited by moio

Loading

kkaempf commented Apr 26, 2023

moio commented Jun 21, 2023

moio commented Aug 1, 2023

moio commented Sep 5, 2023

moio commented Oct 24, 2023

[SURE-3805] Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls) #1491

[SURE-3805] Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls) #1491

Comments

kkaempf commented Apr 26, 2023 • edited by moio Loading

Issue description:

Business impact:

Troubleshooting steps:

Repro steps:

Workaround:

Actual behavior:

Expected behavior:

Files, logs, traces:

kkaempf commented Apr 26, 2023

moio commented Jun 21, 2023

moio commented Aug 1, 2023

moio commented Sep 5, 2023

moio commented Oct 24, 2023

kkaempf commented Apr 26, 2023 •

edited by moio

Loading