-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etcd start failed after power off and restart #11949
Comments
My etcd version is 3.4.9 # cd /var/lib/etcd/
# tree
.
└── member
├── snap
│ ├── 0000000000000003-0000000000033466.snap
│ ├── 0000000000000003-0000000000035b77.snap
│ ├── 0000000000000003-0000000000038288.snap
│ ├── 0000000000000003-000000000003a999.snap
│ ├── 0000000000000003-000000000003d0aa.snap
│ └── db
└── wal
├── 0000000000000000-0000000000000000.wal
├── 0000000000000001-0000000000014a00.wal
├── 0000000000000002-000000000002a59b.wal
└── 0.tmp etcd start log is: {"level":"info","ts":"2020-06-26T13:11:02.536+0800","caller":"etcdmain/etcd.go:134","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{"level":"info","ts":"2020-06-26T13:11:02.536+0800","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.145.10:2380"]}
{"level":"info","ts":"2020-06-26T13:11:02.536+0800","caller":"embed/etcd.go:465","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2020-06-26T13:11:02.537+0800","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.145.10:2379"]}
{"level":"info","ts":"2020-06-26T13:11:02.537+0800","caller":"embed/etcd.go:299","msg":"starting an etcd server","etcd-version":"3.4.9","git-sha":"54ba95891","go-version":"go1.12.17","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"192.168.145.10","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":true,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.145.10:2380"],"listen-peer-urls":["https://192.168.145.10:2380"],"advertise-client-urls":["https://192.168.145.10:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.145.10:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2020-06-26T13:11:02.538+0800","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"571.607µs"}
{"level":"info","ts":"2020-06-26T13:11:02.913+0800","caller":"etcdserver/server.go:451","msg":"recovered v2 store from snapshot","snapshot-index":250026,"snapshot-size":"7.5 kB"}
{"level":"warn","ts":"2020-06-26T13:11:02.915+0800","caller":"snap/db.go:92","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":250026,"snapshot-file-path":"/var/lib/etcd/member/snap/000000000003d0aa.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2020-06-26T13:11:02.915+0800","caller":"etcdserver/server.go:462","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/etcdserver.NewServer\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:462\ngo.etcd.io/etcd/embed.StartEtcd\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/embed/etcd.go:211\ngo.etcd.io/etcd/etcdmain.startEtcd\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:302\ngo.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:144\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}
panic: failed to recover v3 backend from snapshot
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc0494e]
goroutine 1 [running]:
go.etcd.io/etcd/etcdserver.NewServer.func1(0xc0003a0f50, 0xc00039ef48)
/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:334 +0x3e
panic(0xee36c0, 0xc000046200)
/usr/local/go/src/runtime/panic.go:522 +0x1b5
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000164420, 0xc0001527c0, 0x1, 0x1)
/home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:229 +0x546
go.uber.org/zap.(*Logger).Panic(0xc0002fc180, 0x10bd441, 0x2a, 0xc0001527c0, 0x1, 0x1)
/home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/[email protected]/logger.go:225 +0x7f
go.etcd.io/etcd/etcdserver.NewServer(0x7ffdcc63c732, 0xe, 0x0, 0x0, 0x0, 0x0, 0xc0001bd000, 0x1, 0x1, 0xc0001bd180, ...)
/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdserver/server.go:462 +0x3d67
go.etcd.io/etcd/embed.StartEtcd(0xc00016c580, 0xc00016cb00, 0x0, 0x0)
/tmp/etcd-release-3.4.9/etcd/release/etcd/embed/etcd.go:211 +0x9d0
go.etcd.io/etcd/etcdmain.startEtcd(0xc00016c580, 0x1092f33, 0x6, 0xc0001bd801, 0x2)
/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/etcd.go:144 +0x2f71
go.etcd.io/etcd/etcdmain.Main()
/tmp/etcd-release-3.4.9/etcd/release/etcd/etcdmain/main.go:46 +0x38
main.main()
/tmp/etcd-release-3.4.9/etcd/release/etcd/main.go:28 +0x20 why the snap file is "0000000000000003-000000000003d0aa.snap",but recoverd file need "/var/lib/etcd/member/snap/000000000003d0aa.snap.db" |
I also encountered the same problem, I ran Kubernetes 1.18.3 in the virtual machine, accidentally several power cuts, or unexpected shutdown, resulting in the loss of "***.snap.db" file, so my entire Kubernates cluster could not start, I hope to solve this problem as soon as possible, and I hope to inform the Kubernetes team as soon as possible. |
Related kubernetes issue: kubernetes/kubernetes#88574 . |
@vekergu snap.db files are different than .snap files. When a member attempts a "snapshot recovery", the leader sends it a full copy of the database, which the member stores as a snap.db file. If a member is restarted during a "snashot recovery" process the member will notice that it needs it and attempt to load it. I don't think etcd is confusing the files. I think the .snap.db is legitimately missing. We fixed one issue related to the order of operations that happen on startup in in 3.4.8 (xref: #11888). But this appears to be a separate problem (note that reports of it occurring on 3.4.2 and 3.4.9). |
In my case it happens in a single control plane node |
Did the log output your case directly say that the snap.db file is missing? I didn't see that. |
Sorry, I forgot sharing my logs, but I get exactly the same error:
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
same problem
|
Same problem!!! Do someone fix this issues. there is many threads on this, but no real solutions |
The same error occurred in one node cluster. It seems that the invariant( snapshot.Metadata.Index <= db.consistentIndex) was violated after the etcd node power off. it is so weird, etcd will commit metadata(consistentindex) before creating a snapshot. I am trying to add some logs to reproduce it, but it has not been successful. The safety of fsync depends on whether the file system forces the disk cache to be flushed to the hardware. @mamiapatrick what file system are you using?
|
@tangcong thanks for your help, really struggle on that and despite different research and several threads it seems that issues is not yet resolved by anyone. For your question: hope to read you soon! |
@mamiapatrick thanks. I am trying to inject pod-failure and container-kill error into etcd cluster, but it seems very hard to reproduce it, do you have any advice to reproduce it? |
@tangcong |
2 broken files remain in wal and snap folders , removing does nothing as automatically skipped , possible to restore or remove snaps or wal files to get to a previous state, issue seems to be the index number references a particular file, snaps folder has 4 files , wal folder has 5, is the 56 in file name linked between snap and wal files ? appears missing a snap with f067 reference ? wal folder content drwx------ 4 root root 4.0K Jan 6 13:57 .. snap folder contents -rw-r--r-- 1 root root 10K Jan 6 13:57 000000000000000a-000000000065c956.snap log from etcd container
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Up on this issue as it doesn't seem solved or that a workaround is documented. Impacted by this too. |
I fill a lot more convenient about updating consistent_index after: #12855 was submitted. |
@ptabor is that feature likely to facilitate recovery of etcd data from previous versions of etcd ? I guess that MR is available in v3.5.0-beta.3 ? |
The PR is in v3.5.0-beta.3. There is no code that would 'ignore' the mismatch between WAL & db. For recovery from such situation I would consider taking 'db' file alone from the |
Thanks for pointing me in the direction of the Trying that I get
produces a 200Mo db file. Starting etcd works, but db looks empty (testing with I also try with
Again 200Mo |
Are you using the |
I get |
My mistake. --with-v3 is in: |
How about:
|
@ptabor with |
Somehow I was able to recover. I commented here in case someone interested |
I've the same issue. However, 2/3 nodes are showing the above message and the entire cluster is down. (One node doesn't show the etcd message below so I'm thinking that service is recoverable?) Services needed for cert renewal on the kubernetes cluster cannot start due to etcd having corrupt DB files. I do not have an explicitly taken etcd backup. Messages are the same:
However, I'm wondering if it's possible to take the
I didn't get clarity how this can be done from the above commands provided thus far. Most howto's assume an explicit backup was taken using |
no real solutions! I also encountered the same problem. my etcd version is 3.4.13-0 |
In case of OpenShift, the restore procedure worked every time for me. Just don't forget to take backups every day and regularly verify them. It doesn't solve this problem but backups of that critical components are essential. |
I build a k8s cluster with kubeadm containing one control plane node and one worker node. The cluster contains one etcd 3.5.1 server in the control plane node. Unfortunately, I also meet several times of etcd error "failed to recover v3 backend from snapshot". I studied etcd source files and found the condition to trigger this kind of error: etcd/server/storage/backend.go Lines 99 to 109 in 986a2b5
I.e. the condition snapshot.Metadata.Index <= db.consistentIndex is violated. It means the consistent index in snap/db file is older than the one in one of snap/*.snap files.But in general cases, it's impossible(?), because the following program ensures(?) the consistent index is written to snap/db file before creating a .snap file: etcd/server/etcdserver/server.go Line 2019 in 986a2b5
Thus, the steps to reproduce violation of the condition snapshot.Metadata.Index <= db.consistentIndex are unknown and difficult to find...
Nevertheless, I propose a workaround to make etcd back to work:
or
Then, your etcd might start without crash. |
removing *.wal and *.snap worked for me thanks! |
it’s working,and my location is |
None of the above solutions worked for me, here is my solution https://etcd.io/docs/v3.5/tutorials/how-to-deal-with-membership/ |
Using workaround like remove *.snap and *.wal worked for me for some time. Recently it doesn't work anymore, and stucking at connecting to server along with maintenance alarm issue. |
Removing *.snap and *.wal can lead to start etcd pods successfully. However, you will lost all data this way. I accidently found another way to fix this by copying db to most recent snap.db and restart kubelet and containerd (CRI-O, docker).
so the most recent snap will be 000000000000000a-000000000418fe9e.snap
etcd pod and k8s control panel pods should be start normally. |
In such that, I hava an etcd cluster for kubernetes which has 3 members. And all machines were powered off yesterday, however, all of the etcd members start failed after power is restored.
My etcd version is
3.4.2
.The etcd member01's error is
failed to find database snapshot file (snap: snapshot file doesn't exist)
, and the errors of member02 and member03 are same:freepages: failed to get all reachable pages
.The full logs are as follows:
member01:
member02:
Finally, we recovered most of the data by our backup file. However, we wonder to know why this happend. Is there any idea to avoid this issue? Thanks!!!
The text was updated successfully, but these errors were encountered: