Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liveness probe for etcd may cause database crash #2759

Closed
wjentner opened this issue Sep 21, 2022 · 14 comments
Closed

liveness probe for etcd may cause database crash #2759

wjentner opened this issue Sep 21, 2022 · 14 comments
Labels
area/etcd priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@wjentner
Copy link

wjentner commented Sep 21, 2022

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"3ddd0f45aa91e2f30c70734b175631bec5b5825a", GitTreeState:"clean", BuildDate:"2022-05-24T12:24:38Z", GoVersion:"go1.18.2", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.14", GitCommit:"bccf857df03c5a99a35e34020b3b63055f0c12ec", GitTreeState:"clean", BuildDate:"2022-09-14T22:36:04Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    bare metal
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a):
Linux k8s-test 5.10.0-15-amd64 #1 SMP Debian 5.10.120-1 (2022-06-09) x86_64 GNU/Linux
  • Container runtime (CRI) (e.g. containerd, cri-o):
    containerd 1.6.8
  • Container networking plugin (CNI) (e.g. Calico, Cilium):
    calico
  • Others:

What happened?

This was first reported in the etcd repository: etcd-io/etcd#14497

Kubeadm creates a manifest for etcd that uses the /health endpoint of etcd in the liveness probe.
When the etcd database exceeds a certain size, the alarm NO SPACE is triggered, causing etcd to go into maintenance mode and allowing only read and delete actions until the size is reduced and etcdctl alarm disarm is sent.
When the alarm is triggered, this causes the /health check to no longer return a 200 response, meaning that the etcd member goes into a CrashLoopBackoff.
Because this happens almost simultaneously on all members, the continuous crashloops eventually cause a fatal error where etcd is no longer able to start up by itself.

The etcd maintainers mention that the /health endpoint should not be used for liveness probes.

In our case, this caused all etcd members to run into this error:

etcd1: {"level":"warn","ts":"2022-09-16T17:53:40.235Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":893213589,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000353d5b95.snap.db","error":"snap: snapshot file doesn't exist"}

etcd2: {"level":"warn","ts":"2022-09-16T17:47:06.327Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":893216556,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000353d672c.snap.db","error":"snap: snapshot file doesn't exist"}

etcd5: {"level":"warn","ts":"2022-09-16T17:46:30.424Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":893201552,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000353d2c90.snap.db","error":"snap: snapshot file doesn't exist"}

As you can see, all members are on different indices and are not able to recover from any snapshot, which was likely caused by the continuous restarts.

What you expected to happen?

etcd should not crashloop causing the database to go into an unrecoverable state.

How to reproduce it (as minimally and precisely as possible)?

An easy way to trigger this behavior is to follow the etcd docs: https://etcd.io/docs/v3.5/op-guide/maintenance/#space-quota

  1. Add the flag to etcd --quota-backend-bytes=16777216 (16MB)
  2. Fill the DB: $ while [ 1 ]; do dd if=/dev/urandom bs=1024 count=1024 | ETCDCTL_API=3 etcdctl put key || break; done
  3. Watch the status of etcd: etcdctl endpoint --cluster status -w table or also etcdctl alarm list (ALARM NO SPACE) should occur.
  4. Check /health endpoint, which is no longer 200
  5. Observe that etcd pod goes into a CrashLoopBackoff state.

Note: The error may only occur with multiple etcd nodes.

Anything else we need to know?

Without the --quota-backend-bytes flag, the alarm is raised at around a DB size of 2GB. We have a moderately small cluster running for almost three years and reached this size recently.

@neolit123
Copy link
Member

@ahrtr @serathius
is this something we have to handle?

latest probe in kubeadm is here:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/etcd/local.go#L206

@neolit123 neolit123 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. area/etcd labels Sep 21, 2022
@neolit123 neolit123 added this to the v1.26 milestone Sep 21, 2022
@ahrtr
Copy link
Member

ahrtr commented Sep 21, 2022

Sorry for that. It's a known issue, which was fixed in 3.5.5.

FYI. etcd-io/etcd#14419

@wjentner
Copy link
Author

What is the recommended solution?
kubeadm 1.24.6 installs etcd 3.5.3.
It also does not contain the update liveness probe from the main branch.

Should I force the etcd version 3.5.5?

@ahrtr
Copy link
Member

ahrtr commented Sep 22, 2022

Kubernetes bumped etcd 3.5.5 on the master branch in kubernetes/kubernetes#112489. I think etcd 3.5.5 should be bumped into previous stable releases as well. cc @dims @neolit123

@pacoxu
Copy link
Member

pacoxu commented Sep 22, 2022

etcd-io/etcd#14382 (comment)
The issue seems to be critical.

#!/bin/bash
./bin/etcd --snapshot-count=5 &
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
kill -9 %1
./bin/etcd --snapshot-count=5

@neolit123
Copy link
Member

neolit123 commented Sep 22, 2022

I think etcd 3.5.5 should be bumped into previous stable releases as well.

from what i've seen these backports to bump etcd in older k8s releases do not get merged. but if someone wants to try, go ahead.
your solution is to tell kubeadm what etcd image version to use. remember to manually handle upgrades

@ahrtr
Copy link
Member

ahrtr commented Sep 22, 2022

@BenTheElder
Copy link
Member

from what i've seen these backports to bump etcd in older k8s releases do not get merged. but if someone wants to try, go ahead.

Though, in this case it should only be a patch bump and may be fine? We have been on 3.5.X since at least k8s 1.23

@neolit123
Copy link
Member

etcd patch backports did not get attention and approval either. +1 from me if someone wants to try.

@wjentner
Copy link
Author

@wjentner FYI. https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

Please let me know whether it works or not.

I have manually recovered our DB using the snap/db file.

As a current workaround I have set the --quota-backend-bytes to 8Gb using the extraArgs in kubeadm such that the alarm is not being raised.
Additionally, I have forced etcd version 3.5.5.
There seems to be no way to overwrite the livenessProbe other than manually editing the manifest directly.

I'm quite surprised that this problem has not yet occurred to more people since the default DB size for the alarm to be raised is around 2GB.

@neolit123
Copy link
Member

There seems to be no way to overwrite the livenessProbe other than manually editing the manifest directly.

check the kubeadm --patches functionality

@ahrtr
Copy link
Member

ahrtr commented Sep 23, 2022

Thanks for the feedback.

I have manually recovered our DB using the snap/db file.

I am curious how did you do it.

@wjentner
Copy link
Author

@ahrtr I basically followed the disaster recovery docs: https://etcd.io/docs/v3.5/op-guide/recovery/

  1. stopped etcd on all nodes
  2. On etcd1: etcdutl snapshot restore /var/lib/etcd/members/snap/db --name etcd1 --initial-cluster etcd1=http://host1:2380,etcd2=http://host2:2380,etcd3=http://host3:2380 --initial-cluster-token new-cluster --initial-advertise-peer-urls http://host1:2380 --skip-hash-check
  3. Started etcd1 matching the above flags plus --quota-backend-bytes=8589934592, --initial-cluster-state=existing, --initial-cluster-token new-cluster
  4. Made sure etcd1 start is successful (besides complaining that other etcd members are not answering)
  5. Create a new snapshot with etcdctl snapshot save
  6. Copy that snapshot to etcd2 and etcd3
  7. Use command as in (2) with adjusted names and without the --skip-hash-check flag
  8. Start etcd2 and etcd3 with the same flags as in (3).

The cluster synced successfully, and all members were healthy afterward.

@ahrtr
Copy link
Member

ahrtr commented Sep 23, 2022

Thanks @wjentner for the feedback, which makes sense.

I just checked the source code, the key point why your steps work is that etcdutl updates the consistent index using the commitId. FYI. v3_snapshot.go#L272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

5 participants