after delete 3 nodes, other node restart cause panic #13466

yangxuanjia · 2021-11-09T07:45:34Z

3 nodes, add other 3 nodes, delete 3 nodes, restart one of left node, cause panic when start.
But If I do a snapshot, restart is OK.

I know when node start, it will recover snapshot, and recover wal.
the wal conclude 3 confChangeRemoveNode conf, so start will run it again.
But if do a snapshot, and restart node, it will not run the remove configs, so it will run ok.

But I think there is a bug.

Snapshot:
term=7 index=70007 nodes=[32cd075ce865df07 4532f632c49aac79 5fe693db009bbced 761a61507fe72261 799f6226de4c7bed 8b7a6e3d29b86eae] confstate={"voters":[3660590167739457287,4986318435359829113,6910373247063932141,8510221444241105505,8763831318964108269,10050466727402696366],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=761a61507fe72261 clusterID=1109e69692ba9883 term=9 commitIndex=78639 vote=5fe693db009bbced
WAL entries:
lastIndex=78639
term         index      type    data
   8         77940      conf    method=ConfChangeRemoveNode id=799f6226de4c7bed
   8         77941      conf    method=ConfChangeRemoveNode id=8b7a6e3d29b86eae
   8         77945      conf    method=ConfChangeRemoveNode id=4532f632c49aac79

Entry types (ConfigChange) count is : 3

"msg":"newRaft 761a61507fe72261 [peers: [32cd075ce865df07,4532f632c49aac79,5fe693db009bbced,761a61507fe72261,799f6226de4c7bed,8b7a6e3d29b86eae]

voters=(3660590167739457287 4986318435359829113 6910373247063932141 8510221444241105505 8763831318964108269 10050466727402696366)"}
voters=(3660590167739457287 4986318435359829113 6910373247063932141 8510221444241105505 10050466727402696366)"}

voters=(3660590167739457287 4986318435359829113 6910373247063932141 8510221444241105505 8763831318964108269 10050466727402696366)"}
voters=(3660590167739457287 4986318435359829113 6910373247063932141 8510221444241105505 10050466727402696366)"}

{"level":"warn","ts":"2021-11-05T15:45:59.551+0800","caller":"membership/cluster.go:427","msg":"skipped removing already removed member","cluster-id":"1109e69692ba9883","local-member-id":"761a61507fe72261","removed-remote-peer-id":"799f6226de4c7bed"}
{"level":"panic","ts":"2021-11-05T15:45:59.551+0800","caller":"rafthttp/transport.go:346","msg":"unexpected removal of unknown remote peer","remote-peer-id":"799f6226de4c7bed","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/api/rafthttp/transport.go:346\ngo.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/api/rafthttp/transport.go:329\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:2301\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:2133\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:1357\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:1179\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:1111\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\t/export/working/src/github.com/go.etcd.io/etcd/pkg/schedule/schedule.go:157"}

The text was updated successfully, but these errors were encountered:

yangxuanjia · 2021-11-09T07:49:39Z

in func (c *RaftCluster) RemoveMember, if removing a removed node, do a warn log.
But in func (t *Transport) removePeer(id types.ID), if removing a removed node, there will panic.
there two actions in one situaction, is it right logic???

func (c *RaftCluster) RemoveMember(id types.ID, shouldApplyV3 ShouldApplyV3) {
	c.Lock()
	defer c.Unlock()
	if c.v2store != nil {
		mustDeleteMemberFromStore(c.lg, c.v2store, id)
	}
	if c.be != nil && shouldApplyV3 {
		mustDeleteMemberFromBackend(c.be, id)
	}
	c.lg.Error("11111111111111 c.members")
	for _, m := range c.members {
		c.lg.Error("id:", zap.String("ID", m.ID.String()))
	}
	c.lg.Error("222222222222222 c.members")

	m, ok := c.members[id]
	delete(c.members, id)
	c.removed[id] = true

	if ok {
		c.lg.Info(
			"removed member",
			zap.String("cluster-id", c.cid.String()),
			zap.String("local-member-id", c.localID.String()),
			zap.String("removed-remote-peer-id", id.String()),
			zap.Strings("removed-remote-peer-urls", m.PeerURLs),
		)
	} else {
		c.lg.Warn(
			"skipped removing already removed member",
			zap.String("cluster-id", c.cid.String()),
			zap.String("local-member-id", c.localID.String()),
			zap.String("removed-remote-peer-id", id.String()),
		)
	}

func (t *Transport) removePeer(id types.ID) {
	t.Logger.Error("333333333333333333 t.peers")
	for id, _ := range t.peers {
		t.Logger.Error("id:", zap.String("ID", id.String()))
	}
	t.Logger.Error("44444444444444444 t.peers")

	if peer, ok := t.peers[id]; ok {
		peer.stop()
	} else {
		if t.Logger != nil {
			t.Logger.Panic("unexpected removal of unknown remote peer", zap.String("remote-peer-id", id.String()))
		}
	}
	delete(t.peers, id)
	delete(t.LeaderStats.Followers, id.String())
	t.pipelineProber.Remove(id.String())
	t.streamProber.Remove(id.String())

	if t.Logger != nil {
		t.Logger.Info(
			"removed remote peer",
			zap.String("local-member-id", t.ID.String()),
			zap.String("removed-remote-peer-id", id.String()),
		)
	}
}

ahrtr · 2021-11-10T15:11:50Z

I did not reproduce this issue with etcd 3.5.0. What's the etcd version in your environment, and please provide the detailed steps/commands on how you produced the issue.

yangxuanjia · 2021-11-10T15:31:19Z

I use the ETCD3.5 version, the detailed process is a bit complicated, and it involves a PR that we modified, I try to describe it clearly.
First, create a three-node cluster in computer room A, and then add a learner in computer room B. After that, the learner is forcibly promoted to the leader. This step involves a PR, and the purpose is to cross-computer room disaster recovery.
#13213

At this time, add 2 new nodes to the B computer room for this new leader to form a new cluster.
When computer room A is available, the previous 3 nodes will be added to this new cluster.
Then ticket the leader role to a node in computer room A, and finally delete 3 nodes in computer room B.
At this time, if the 3 nodes in the A computer room have a snapshot before the 3 nodes in the B computer room are deleted.
After deleting the 3 nodes in the B computer room, there will be 3 removeMember raft logs in the wal of the 3 nodes in the A computer room.
At this time, restart a node in the A computer room, and this error will be reported.
I feel that the reason for the error is that the peers in the B computer room are not deleted in the peers, but when restarting the playback WAL, I encountered the raft log of the removeMember, and then panic.

When I retry the entire process, if the entire process is executed step by step, before deleting the 3 nodes of the B computer room, there is no snapshot in the 3 nodes of the A computer room, then the wal of the A computer room will contain all the changeConfig, this When restarting the node in computer room A at time, all WALs will be played back, and this error will not be reported.

yangxuanjia · 2021-11-10T15:37:41Z

This step is indeed a bit complicated, and the question to me is why the 3 nodes in computer room B are missing from peers, and it stands to reason before the 3 nodes in computer room B are deleted.
If the node in computer room A has made a snapshot, the snapshot should contain the information of 3 nodes in computer room B. At this time, delete the 3 nodes in computer room B, then restart the node in computer room A, and play back 3 removeMember normally.
Why is there no information about these 3 nodes in peers and then panic.

ahrtr · 2021-11-14T22:15:10Z

Did you see this issue with the official etcd 3.5.0 release, or you built the etcd binary with your PR on release-3.5?

I would suggest you to provide the detailed steps & commands on how you produced this issue, so as to avoid any misunderstandings.

yangxuanjia · 2021-11-15T07:43:19Z

@ahrtr
I used the ETCD3.5 version to stably reproduce this problem, and the process is divided into the following steps:

Establish a 3 node cluster A, B, C
Write a few keys casually
Make an automatic snapshot for each node
use this code quickly do an auto snapshot: Do snapshot now #13474
removeMember node C
Restart node A, -----> Panic

yangxuanjia · 2021-11-15T07:49:09Z

func recoverSnapshotBackend(cfg config.ServerConfig, oldbe backend.Backend, snapshot raftpb.Snapshot, beExist bool, hooks backend.Hooks) (backend.Backend, error) {
	consistentIndex := uint64(0)
	if beExist {
		consistentIndex, _ = cindex.ReadConsistentIndex(oldbe.BatchTx())
	}
        
        // code enter this branch!!!!!!!!!!!!!!!!!!  return oldbe  !!!!!!!!  > _ <
	if snapshot.Metadata.Index <= consistentIndex {
		return oldbe, nil
	}
	oldbe.Close()
	return openSnapshotBackend(cfg, snap.New(cfg.Logger, cfg.SnapDir()), snapshot, hooks)
}

The logic of the overall error is basically clear. The key is to take a snapshot before deleting the node.
Then the biggest problem is in the above code. If the index of the snapshot is less than the ConsistentIndex, start with the old backend. The old backend has the latest member information (including the deleted node information), but the WAL playback is based on the snapshotIndex. Play back at the starting position, so the removeMember raft changeConfig will be played back.

yangxuanjia · 2021-11-15T07:52:28Z

During playback, removeMember() found that the node had been deleted, and recorded a warn log. And removePeer() found that the node had been deleted, and it was panic, which caused the node to restart and panic.

ahrtr · 2021-11-15T17:06:17Z

Yes, I reproduced this issue, and It indeed is a real issue!

The root cause of this issue is that the data is recovered from the db file because the snapshot.Metadata.Index is less than the consistentIndex (see backend.go#L104 ), while the WAL files are replayed against the latest snapshot file.

The default value for --snapshot-count is 100000, so the worse case is etcd may replay unnecessarily 99999 entries on startup. It seems that there is a performance improvement we can do.

cc @ptabor @hexfusion @serathius

The old backend has the latest member information (including the deleted node information)

@yangxuanjia It isn't true. The backend (the parameter oldbe for function recoverSnapshotBackend) contains all the members, but the removed member is included in the members_removed bucket.

$ go run cmd/bbolt/main.go keys ~/tmp/etcd/infra1.etcd/member/snap/db members
8211f1d0f64f3269
91bc3c398fb3c146

$ go run cmd/bbolt/main.go keys ~/tmp/etcd/infra1.etcd/member/snap/db members_removed
fd422379fda50e48

yangxuanjia · 2021-11-16T02:22:39Z

@ahrtr
do you think this PR #13479 is a temp solution?

ahrtr · 2021-11-19T05:10:26Z

Actually this issue can't be reproduced on the latest code in release-3.5, the reason is it loads the members info from v2store firstly, see cluster.go#L259-L265. So the RaftCluster.members has 3 members, including the removed members.

But this issue can be easily reproduced on 3.5.0 and the latest code in main branch, because both of them load the members info from the db file firstly, see 3.5.0/cluster.go and main/cluster.go, so the RaftCluster.members has only 2 members, not including the removed member.

Basically change in PR is OK, but please re-organize the warning message per the comment.

ahrtr · 2021-11-19T05:59:02Z

FYI. I summarized this issue in the following page, https://github.com/ahrtr/etcd-issues/tree/master/issues/13466

ahrtr · 2021-11-22T02:56:27Z

@serathius has been working on a similar and bigger issue. The fix is already been included in 3.5.1, and it's the reason why this issue could not be reproduced on 3.5.1.

@serathius are you working the fix for 3.6? Will the v2store still be deprecated in 3.6?

serathius · 2021-11-22T12:09:05Z

Please check StoreV2 deprecation plan in #12913

stale · 2022-02-21T16:14:14Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

ahrtr · 2022-02-21T16:16:14Z

The PR has already been merged, so closing this issue.

yangxuanjia mentioned this issue Nov 15, 2021

fix panic when restart after removeMember #13479

Closed

This was referenced Feb 18, 2022

fix panic when restart after removeMember #13645

Merged

[test] Add an e2e or integration test to cover the issue 13466 #13715

Open

stale bot added the stale label Feb 21, 2022

ahrtr closed this as completed Feb 21, 2022

ahrtr mentioned this issue Oct 25, 2024

raftexample deletes nodes and adds old nodes abnormally #18777

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

after delete 3 nodes, other node restart cause panic #13466

after delete 3 nodes, other node restart cause panic #13466

yangxuanjia commented Nov 9, 2021 •

edited

Loading

yangxuanjia commented Nov 9, 2021

ahrtr commented Nov 10, 2021 •

edited

Loading

yangxuanjia commented Nov 10, 2021 •

edited

Loading

yangxuanjia commented Nov 10, 2021 •

edited

Loading

ahrtr commented Nov 14, 2021 •

edited

Loading

yangxuanjia commented Nov 15, 2021 •

edited

Loading

yangxuanjia commented Nov 15, 2021 •

edited

Loading

yangxuanjia commented Nov 15, 2021

ahrtr commented Nov 15, 2021 •

edited

Loading

yangxuanjia commented Nov 16, 2021

ahrtr commented Nov 19, 2021 •

edited

Loading

ahrtr commented Nov 19, 2021 •

edited

Loading

ahrtr commented Nov 22, 2021

serathius commented Nov 22, 2021

stale bot commented Feb 21, 2022

ahrtr commented Feb 21, 2022

after delete 3 nodes, other node restart cause panic #13466

after delete 3 nodes, other node restart cause panic #13466

Comments

yangxuanjia commented Nov 9, 2021 • edited Loading

yangxuanjia commented Nov 9, 2021

ahrtr commented Nov 10, 2021 • edited Loading

yangxuanjia commented Nov 10, 2021 • edited Loading

yangxuanjia commented Nov 10, 2021 • edited Loading

ahrtr commented Nov 14, 2021 • edited Loading

yangxuanjia commented Nov 15, 2021 • edited Loading

yangxuanjia commented Nov 15, 2021 • edited Loading

yangxuanjia commented Nov 15, 2021

ahrtr commented Nov 15, 2021 • edited Loading

yangxuanjia commented Nov 16, 2021

ahrtr commented Nov 19, 2021 • edited Loading

ahrtr commented Nov 19, 2021 • edited Loading

ahrtr commented Nov 22, 2021

serathius commented Nov 22, 2021

stale bot commented Feb 21, 2022

ahrtr commented Feb 21, 2022

yangxuanjia commented Nov 9, 2021 •

edited

Loading

ahrtr commented Nov 10, 2021 •

edited

Loading

yangxuanjia commented Nov 10, 2021 •

edited

Loading

yangxuanjia commented Nov 10, 2021 •

edited

Loading

ahrtr commented Nov 14, 2021 •

edited

Loading

yangxuanjia commented Nov 15, 2021 •

edited

Loading

yangxuanjia commented Nov 15, 2021 •

edited

Loading

ahrtr commented Nov 15, 2021 •

edited

Loading

ahrtr commented Nov 19, 2021 •

edited

Loading

ahrtr commented Nov 19, 2021 •

edited

Loading