Persistent memory leak of k3s control plane instance #10922

craigcabrey · 2024-09-21T18:45:17Z

Environmental Info:
K3s Version: v1.29.8

[root@venus-node-3 /]# k3s -v
k3s version v1.29.8+k3s1 (33fdc35d)
go version go1.22.5

Node(s) CPU architecture, OS, and Version:
Linux venus-node-3 6.10.6-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 19 14:09:30 UTC 2024 x86_64 GNU/Linux

cmdline:

BOOT_IMAGE=(hd1,gpt3)/ostree/fedora-coreos-2e3c9b45afc73fe355d3ea8072021317aaa8909e49a642167e41e9abab5babbf/vmlinuz-6.10.6-200.fc40.x86_64 mitigations=auto,nosmt ignition.platform.id=metal ostree=/ostree/boot.0/fedora-coreos/2e3c9b45afc73fe355d3ea8072021317aaa8909e49a642167e41e9abab5babbf/0 root=UUID=bb46740c-4cdf-409f-b3e2-aa78cd0a3a36 rw rootflags=prjquota boot=UUID=a6f11c3a-d6b7-43ba-8314-9377ad17d709

[craigcabrey@tealboi ~]$ k get node -o wide
NAME           STATUS                     ROLES                       AGE    VERSION        INTERNAL-IP                            EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION           CONTAINER-RUNTIME
ms01-node-1    Ready                      <none>                      55d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:7a2a:376a:8404:ded1   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1
ms01-node-2    Ready                      <none>                      112d   v1.29.8+k3s1   fdb5:12c1:f8cb:0:7cdd:1855:7d3b:5ee2   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1
ms01-node-3    Ready                      <none>                      112d   v1.29.8+k3s1   fdb5:12c1:f8cb:0:851e:d111:ad0e:473d   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1
ms01-node-4    Ready                      <none>                      34d    v1.29.6+k3s2   fdb5:12c1:f8cb:0:218a:a6ca:5c7f:f5d4   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.17-k3s1
ms01-node-5    Ready                      <none>                      54d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:f67:d816:2aea:e002    <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1
ms01-node-6    Ready                      <none>                      54d    v1.29.6+k3s2   fdb5:12c1:f8cb:0:5fe8:9d4c:eb99:b755   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.17-k3s1
ms01-node-7    Ready                      <none>                      39d    v1.29.6+k3s2   fdb5:12c1:f8cb:0:7f46:1377:653:e6f5    <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.17-k3s1
pi-node-1      Ready                      control-plane,etcd,master   34d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:560b:4bca:422c:f49    <none>        Debian GNU/Linux 12 (bookworm)   6.6.39-v8-16k+           containerd://1.7.20-k3s1
pi-node-2      Ready                      control-plane,etcd,master   56d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:f507:7bda:a045:ad62   <none>        Debian GNU/Linux 12 (bookworm)   6.6.39-v8-16k+           containerd://1.7.20-k3s1
pi-node-3      Ready                      control-plane,etcd,master   34d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:857f:41e8:ff46:fe9f   <none>        Debian GNU/Linux 12 (bookworm)   6.6.40-v8-16k+           containerd://1.7.20-k3s1
venus-node-1   Ready                      <none>                      114d   v1.29.6+k3s2   fdb5:12c1:f8cb:0:63a8:555c:1d73:2b8b   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.17-k3s1
venus-node-2   Ready                      control-plane,etcd,master   27d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:49bd:e1a7:a96b:9501   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1
venus-node-3   Ready,SchedulingDisabled   control-plane,etcd,master   28d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:97ed:8915:ab57:61d4   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1

Cluster Configuration:

$ cat /etc/rancher/k3s/config.yaml
server: https://[fdb5:12c1:f8cb:0:560b:4bca:422c:f49]:6443
embedded-registry: true
secrets-encryption: true
disable:
  - servicelb
  - traefik

cluster-cidr: fdb5:12c1:f8cb:dead:beaf::/96,10.42.0.0/16
service-cidr: fdb5:12c1:f8cb:dead:c0de::/108,10.43.0.0/16

flannel-backend: wireguard-native
flannel-ipv6-masq: true
flannel-iface: internal

node-ip: fdb5:12c1:f8cb:0:97ed:8915:ab57:61d4,192.168.7.13

kubelet-arg:
  - "node-ip=::"                                                                                                                                                                                                                                                                                                                  kube-controller-manager-arg:
  - node-cidr-mask-size-ipv6=108                                                                                                                                                                                                                                                                                                  # https://docs.k3s.io/cli/server#listeners
tls-san: k8s.internal.lan
write-kubeconfig-mode: "0644"

Describe the bug:

Steps To Reproduce:

Installed K3s: v1.29.8

I have a simple drop in:

[root@venus-node-3 /]# cat /etc/systemd/system/k3s.service.d/99-custom.conf
[Service]
ExecStartPre=-/usr/sbin/ip link property add dev internal altname vlan-provider

Expected behavior:

I have a control plane node that runs out of memory after 1-2 days. I've experimented a bit, and this happens even when the node is cordoned and minimal pods are running.

Actual behavior:

Memory usage of k3s & containerd grows over a 1-2 day period to consume all memory on the host. This happened on v1.29.6 as well, upgraded to v1.29.8 but no change was observed.

Additional context / logs:

I also have below logs (similar to atop if you aren't familiar) showing the cgroup & process level stats over a 24h+ period. You can see the RSS grow over the period uncontrollably:

~10 hours ago:

~15 minutes ago:

Grafana stats:

The text was updated successfully, but these errors were encountered:

craigcabrey · 2024-09-21T18:46:46Z

Process view right before I ran systemctl restart k3s:

craigcabrey · 2024-09-21T18:50:00Z

Process view ~10 hours ago:

This shows that both k3s & containerd are growing. These views are tracking the systemd cgroup slices, not the workloads. So there should be no contamination of workload behavior on these stats (but separately, as noted above I minimized the number of pods running on this node & checked the stats of said pods -- all were within reason).

craigcabrey · 2024-09-21T18:54:04Z

And process view after restarting:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent memory leak of k3s control plane instance #10922

Persistent memory leak of k3s control plane instance #10922

craigcabrey commented Sep 21, 2024 •

edited

Loading

craigcabrey commented Sep 21, 2024

craigcabrey commented Sep 21, 2024

craigcabrey commented Sep 21, 2024

Persistent memory leak of k3s control plane instance #10922

Persistent memory leak of k3s control plane instance #10922

Comments

craigcabrey commented Sep 21, 2024 • edited Loading

craigcabrey commented Sep 21, 2024

craigcabrey commented Sep 21, 2024

craigcabrey commented Sep 21, 2024

craigcabrey commented Sep 21, 2024 •

edited

Loading