-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
after delete 3 nodes, other node restart cause panic #13466
Comments
in func (c *RaftCluster) RemoveMember, if removing a removed node, do a warn log.
|
I did not reproduce this issue with etcd 3.5.0. What's the etcd version in your environment, and please provide the detailed steps/commands on how you produced the issue. |
I use the ETCD3.5 version, the detailed process is a bit complicated, and it involves a PR that we modified, I try to describe it clearly. At this time, add 2 new nodes to the B computer room for this new leader to form a new cluster. When I retry the entire process, if the entire process is executed step by step, before deleting the 3 nodes of the B computer room, there is no snapshot in the 3 nodes of the A computer room, then the wal of the A computer room will contain all the changeConfig, this When restarting the node in computer room A at time, all WALs will be played back, and this error will not be reported. |
This step is indeed a bit complicated, and the question to me is why the 3 nodes in computer room B are missing from peers, and it stands to reason before the 3 nodes in computer room B are deleted. |
Did you see this issue with the official etcd 3.5.0 release, or you built the etcd binary with your PR on release-3.5? I would suggest you to provide the detailed steps & commands on how you produced this issue, so as to avoid any misunderstandings. |
@ahrtr
|
The logic of the overall error is basically clear. The key is to take a snapshot before deleting the node. |
During playback, removeMember() found that the node had been deleted, and recorded a warn log. And removePeer() found that the node had been deleted, and it was panic, which caused the node to restart and panic. |
Yes, I reproduced this issue, and It indeed is a real issue! The root cause of this issue is that the data is recovered from the db file because the snapshot.Metadata.Index is less than the consistentIndex (see backend.go#L104 ), while the WAL files are replayed against the latest snapshot file. The default value for --snapshot-count is 100000, so the worse case is etcd may replay unnecessarily 99999 entries on startup. It seems that there is a performance improvement we can do. cc @ptabor @hexfusion @serathius
@yangxuanjia It isn't true. The backend (the parameter oldbe for function recoverSnapshotBackend) contains all the members, but the removed member is included in the members_removed bucket.
|
Actually this issue can't be reproduced on the latest code in release-3.5, the reason is it loads the members info from v2store firstly, see cluster.go#L259-L265. So the RaftCluster.members has 3 members, including the removed members. But this issue can be easily reproduced on 3.5.0 and the latest code in main branch, because both of them load the members info from the db file firstly, see 3.5.0/cluster.go and main/cluster.go, so the RaftCluster.members has only 2 members, not including the removed member. Basically change in PR is OK, but please re-organize the warning message per the comment. |
FYI. I summarized this issue in the following page, https://github.com/ahrtr/etcd-issues/tree/master/issues/13466 |
@serathius has been working on a similar and bigger issue. The fix is already been included in 3.5.1, and it's the reason why this issue could not be reproduced on 3.5.1. @serathius are you working the fix for 3.6? Will the v2store still be deprecated in 3.6? |
Please check StoreV2 deprecation plan in #12913 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
The PR has already been merged, so closing this issue. |
3 nodes, add other 3 nodes, delete 3 nodes, restart one of left node, cause panic when start.
But If I do a snapshot, restart is OK.
I know when node start, it will recover snapshot, and recover wal.
the wal conclude 3 confChangeRemoveNode conf, so start will run it again.
But if do a snapshot, and restart node, it will not run the remove configs, so it will run ok.
But I think there is a bug.
voters=(3660590167739457287 4986318435359829113 6910373247063932141 8510221444241105505 8763831318964108269 10050466727402696366)"}
voters=(3660590167739457287 4986318435359829113 6910373247063932141 8510221444241105505 10050466727402696366)"}
{"level":"warn","ts":"2021-11-05T15:45:59.551+0800","caller":"membership/cluster.go:427","msg":"skipped removing already removed member","cluster-id":"1109e69692ba9883","local-member-id":"761a61507fe72261","removed-remote-peer-id":"799f6226de4c7bed"}
{"level":"panic","ts":"2021-11-05T15:45:59.551+0800","caller":"rafthttp/transport.go:346","msg":"unexpected removal of unknown remote peer","remote-peer-id":"799f6226de4c7bed","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/api/rafthttp/transport.go:346\ngo.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/api/rafthttp/transport.go:329\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:2301\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:2133\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:1357\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:1179\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:1111\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\t/export/working/src/github.com/go.etcd.io/etcd/pkg/schedule/schedule.go:157"}
The text was updated successfully, but these errors were encountered: