Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD fails to start after performing alarm list operation and then power off/on #14382

Closed
tobgu opened this issue Aug 25, 2022 · 17 comments · Fixed by #14419
Closed

ETCD fails to start after performing alarm list operation and then power off/on #14382

tobgu opened this issue Aug 25, 2022 · 17 comments · Fixed by #14419
Labels

Comments

@tobgu
Copy link

tobgu commented Aug 25, 2022

What happened?

With a cluster of three nodes setup using https://github.com/bitnami/charts/blob/master/bitnami/etcd/README.md it was noticed that the ETCD nodes failed to start after a forced reboot of the underlying worker nodes. A graceful shutdown will not result in this issue.

The logs indicated a mismatch in the raft log index between the v2 *.snap files and the v3 db file where the index of the snap files was higher than that of the v3 db file causing ETCD to look for a snap.db file that did not exist (see logs).

The index of the snap file was derived from the file name (eg. 0000000000000017-0000000000124f8c.snap) while the consistent_index of the v3 db was extracted using bbolt, bbolt get db meta consistent_index | hexdump => 0xb4903.

So far the issue looked very much like what is described in #11949. The "fix" described in that issue to get the cluster up and running again also worked, to remove/move the *.snap files.

Worth mentioning: This cluster had not had any writes to it for serveral weeks ahead of the reboot. The data in it is mostly read. Doing a proper write to the cluster will set the consistent_index of the v3 DB to an up-to-date value of the raft index.

After some investigation into why this index difference the between the snapshots and the v3 store occurred it was found that the health check executed regularly by Kubernetes was the reason for the version drift.

The health and readiness check regularly executes etcdctl endpoint health to determine if the cluster is healthy or not. In ETCD 3.4 this was a simple GET on the health key but since #12150 it also includes checking the alarm list to verify that it is empty. For some reason listing the alarms also triggers a write/apply (see attached logs). And for some reason this apply is only applied to the V2 store, not the V3 store. This cause the index in the V2 store to drift from the V3 store until a proper write is performed. I have not dug into the reason for why the write is performed and why it is missing from the V3 store.

The behaviour is only present in this form in 3.5 since the health check in 3.4 does not include listing the alarms.

The problem is easy to reproduce locally. See description.

What did you expect to happen?

I would always expect ETCD to be able to start properly regardless of how the shutdown was done.

How can we reproduce it (as minimally and precisely as possible)?

Locally:

# Use very low snapshot count to more quickly trigger the issue
$  ./etcd --snapshot-count=5 --log-level=debug

# In a different terminal, repeat six times to trigger creation of a snapshot file (you'll see a `triggering snapshot` log message in the server log when this happens)
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health

# Kill etcd server process
$ kill -9 <etcd PID>

# Try to start ETCD again, this will fail with the attached log message
$ ./etcd --snapshot-count=5 --log-level=debug

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: darwin/amd64

$ etcdctl version
etcdctl version: 3.5.4
API version: 3.5

Etcd configuration (command line flags or environment variables)

See steps to reproduce

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

See steps to reproduce

Relevant log output

# Log printed upon startup failure
{"level":"info","ts":"2022-07-23T18:21:35.360Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/bitnami/etcd/data/member/snap/db","took":"174.548µs"}
{"level":"info","ts":"2022-07-23T18:21:36.259Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":1200012,"snapshot-size":"25 kB"}
{"level":"warn","ts":"2022-07-23T18:21:36.260Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":1200012,"snapshot-file-path":"/bitnami/etcd/data/member/snap/0000000000124f8c.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-07-23T18:21:36.261Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000506300, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000010fa0, 0x1234726, 0x2a, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0xc0000460ca, 0xd, 0x0, 0x0, 0x0, 0x0, 0xc0005150e0, 0x2, 0x2, 0xc00053c480, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000104000, 0xc000104600, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000104000, 0x12089be, 0x6, 0xc0004b7201, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f


# Server debug logs when running `etcdctl endpoint health`
{"level":"debug","ts":"2022-08-25T15:21:32.425+0200","caller":"v3rpc/interceptor.go:182","msg":"request stats","start time":"2022-08-25T15:21:32.425+0200","time spent":"114.504µs","remote":"127.0.0.1:51728","response type":"/etcdserverpb.KV/Range","request count":0,"request size":8,"response count":0,"response size":28,"request content":"key:\"health\" "}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2122","msg":"Applying entries","num-entries":1}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2125","msg":"Applying entry","index":70,"term":5,"type":"EntryNormal"}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2184","msg":"apply entry normal","consistent-index":60,"entry-index":70,"should-applyV3":true}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2211","msg":"applyEntryNormal","raftReq":"header:<ID:7587864799024601626 > alarm:<> "}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"v3rpc/interceptor.go:182","msg":"request stats","start time":"2022-08-25T15:21:32.426+0200","time spent":"241.967µs","remote":"127.0.0.1:51728","response type":"/etcdserverpb.Maintenance/Alarm","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"info","ts":"2022-08-25T15:21:32.428+0200","caller":"zapgrpc/zapgrpc.go:174","msg":"[transport] transport: loopyWriter.run returning. connection error: desc = \"transport is closing\""}
@tobgu tobgu added the type/bug label Aug 25, 2022
@serathius
Copy link
Member

serathius commented Aug 25, 2022

Easily reproducible with script:

#!/bin/bash
./bin/etcd --snapshot-count=5 &
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
kill -9 %1
./bin/etcd --snapshot-count=5

@serathius
Copy link
Member

serathius commented Aug 25, 2022

Really great analysis and repro! @tobgu would you be willing to send PR with a fix?

@tobgu
Copy link
Author

tobgu commented Aug 25, 2022

I would have to spend time trying to understand the V2/V3 sync logic for that and why it is failing in this case. If nobody else knows of a quick solution/fix to this I can do that but I was hoping that someone well versed in this area would pick it up. ;-)

@serathius
Copy link
Member

Heh, only if I could say that I understand it better. From what I investigated problem is due to consistent index (CI) not being saved when applying Alarms entry. I managed to confirm that this behavior was introduced between v3.5.2 and v3.5.3. Meaning that this is a bug for in fix for data inconsistency issue :(.

As the result snapshot index is greater than CI, which during etcd bootstrap is interpreted that etcd crushed during process of receiving snapshot from leader. Etcd panics as it assumes that there should be a v3 snapshot received from leader.

@serathius
Copy link
Member

serathius commented Aug 25, 2022

Ok, looks like v3.5.3 assumes that performing applyV3 without an error means that consistency index was commited. However that's not true in case of Alarms, as it doesn't touch db, meaning it doesn't open any transactions, which would lead to executing backend hooks.

Backend hooks bad, again.

@serathius
Copy link
Member

Some more details, issue is triggered when etcd crashes after snapshot that was followed by only Alarms entries.
It causes a transitory bad state of db, would prevent etcd to recover after crash. It's transitory as db will be fixed after any WAL entry that is not Alarms.

@serathius
Copy link
Member

serathius commented Aug 25, 2022

Impact: Low as the issue should be extremely rare.

It requires:

  • Snapshot triggered on Alarm Request entry
  • Hard crash of etcd before any non-alarm entries are added

Still it brings up the issue of how incomprehensible etcd apply code is. I think this will be third times we try to fix just this part of the code in v3.5.X

@tobgu
Copy link
Author

tobgu commented Aug 25, 2022

I'm not sure I agree with the conclusion that this is an extremely rare case.

Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...

My current workaround is to not use the default probes in the chart but rather hit the /health HTTP endpoint instead (which doesn't seem to suffer from the same problem).

@serathius
Copy link
Member

Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...

Can you link to how Bitnami does healthcheck?

@serathius
Copy link
Member

cc @ahrtr @ptabor for their opinion

@tobgu
Copy link
Author

tobgu commented Aug 26, 2022

Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...

Can you link to how Bitnami does healthcheck?

Sure!

This is how the default liveness- and readiness probes and are setup (the ones I've now replaced with http probes against /health): https://github.com/bitnami/charts/blob/master/bitnami/etcd/templates/statefulset.yaml#L265-L289

And this is the shell script that is called by the above probes: https://github.com/bitnami/bitnami-docker-etcd/blob/master/3.5/debian-11/rootfs/opt/bitnami/scripts/etcd/healthcheck.sh

@ahrtr
Copy link
Member

ahrtr commented Aug 29, 2022

Thanks @tobgu for raising this issue ( a real issue)!

It turns out to be a regression introduced in 3.5.4 in #13854 (#13908). The alarm list is the only exception that doesn't move consistent_index forward. The reproduction steps are as simple as,

etcd --snapshot-count=5 &
for i in {1..6}; do etcdctl  alarm list; done 
kill -9 <etcd_pid>
etcd

Short term solution

Solution 1

Lock batch_tx in (*AlarmStore) Get, so that it calls the txPostLockInsideApplyHook, and accordingly move consistent_index forward.

Just need to add code something like below into (*AlarmStore) Get. It's the simplest change, but it looks ugly, because it doesn't make sense for alarmList to acquire the batchTx lock at all.

	tx := s.be.BatchTx()
	tx.LockInsideApply()
	defer tx.Unlock()

Solution 2

Change server.go#L1853-L1855 to something like below. I don't know why I did not do this previously. Will think about this more and get back if I recall something new.

			newIndex := s.consistIndex.ConsistentIndex()
			if newIndex < e.Index {
				s.consistIndex.SetConsistentIndex(e.Index, e.Term)
			}

Solution 3

Change the alarmList to use linearizableReadLoop, so that it doesn't go through the raft & applying workflow at all. Accordingly it will not advance the applyIndex at all, and the snapshot Index will not be advanced.

Long term solution

Get rid of the OnPreCommitUnsafe added in 3.5.0 and the txPostLockInsideApplyHook/LockInsideApply/LockOutsideApply added in main & 3.5.3 & 3.5.4.

I will take care of the long-term solution.

For short-term, solution 2 above looks the best for now. Anyone feel free to deliver a PR.

@ahrtr
Copy link
Member

ahrtr commented Aug 31, 2022

Is anyone working on this issue?

@tobgu
Copy link
Author

tobgu commented Sep 1, 2022

I'm not working on this. If no one else picks it up I might be able to find some time for it in a couple of weeks. Right now my priorities do not allow it.
I'm happy with the current workaround (a change in the health check used) as a short term solution in terms of stability but would of course want this to be fixed as a proper long term solution in order to not have to worry about triggering the behaviour experienced in some other way.

ahrtr added a commit to ahrtr/etcd that referenced this issue Sep 6, 2022
Issues:
1. etcd-io#14402 fixed in 3.4 only;
2. etcd-io#14382 fixed in both 3.5 and main.

Signed-off-by: Benjamin Wang <[email protected]>
@ahrtr ahrtr changed the title ETCD fails to start after power off/on ETCD fails to start after performing alarm list operation and then power off/on Sep 6, 2022
@ahrtr
Copy link
Member

ahrtr commented Sep 22, 2022

In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

@wcmjunior
Copy link

In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

This solution fixed our problem in a production etcd configuration. Just commenting here so more people can follow. Thanks!

@kycfeel
Copy link

kycfeel commented Dec 4, 2022

In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

@ahrtr You are a life saver!!!!! This solved my issue like magic. Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

5 participants