Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

after delete 3 nodes, other node restart cause panic #13466

Closed
yangxuanjia opened this issue Nov 9, 2021 · 16 comments
Closed

after delete 3 nodes, other node restart cause panic #13466

yangxuanjia opened this issue Nov 9, 2021 · 16 comments
Labels

Comments

@yangxuanjia
Copy link
Contributor

yangxuanjia commented Nov 9, 2021

3 nodes, add other 3 nodes, delete 3 nodes, restart one of left node, cause panic when start.
But If I do a snapshot, restart is OK.

I know when node start, it will recover snapshot, and recover wal.
the wal conclude 3 confChangeRemoveNode conf, so start will run it again.
But if do a snapshot, and restart node, it will not run the remove configs, so it will run ok.

But I think there is a bug.

Snapshot:
term=7 index=70007 nodes=[32cd075ce865df07 4532f632c49aac79 5fe693db009bbced 761a61507fe72261 799f6226de4c7bed 8b7a6e3d29b86eae] confstate={"voters":[3660590167739457287,4986318435359829113,6910373247063932141,8510221444241105505,8763831318964108269,10050466727402696366],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=761a61507fe72261 clusterID=1109e69692ba9883 term=9 commitIndex=78639 vote=5fe693db009bbced
WAL entries:
lastIndex=78639
term         index      type    data
   8         77940      conf    method=ConfChangeRemoveNode id=799f6226de4c7bed
   8         77941      conf    method=ConfChangeRemoveNode id=8b7a6e3d29b86eae
   8         77945      conf    method=ConfChangeRemoveNode id=4532f632c49aac79

Entry types (ConfigChange) count is : 3
"msg":"newRaft 761a61507fe72261 [peers: [32cd075ce865df07,4532f632c49aac79,5fe693db009bbced,761a61507fe72261,799f6226de4c7bed,8b7a6e3d29b86eae]

voters=(3660590167739457287 4986318435359829113 6910373247063932141 8510221444241105505 8763831318964108269 10050466727402696366)"}
voters=(3660590167739457287 4986318435359829113 6910373247063932141 8510221444241105505 10050466727402696366)"}

voters=(3660590167739457287 4986318435359829113 6910373247063932141 8510221444241105505 8763831318964108269 10050466727402696366)"}
voters=(3660590167739457287 4986318435359829113 6910373247063932141 8510221444241105505 10050466727402696366)"}



{"level":"warn","ts":"2021-11-05T15:45:59.551+0800","caller":"membership/cluster.go:427","msg":"skipped removing already removed member","cluster-id":"1109e69692ba9883","local-member-id":"761a61507fe72261","removed-remote-peer-id":"799f6226de4c7bed"}
{"level":"panic","ts":"2021-11-05T15:45:59.551+0800","caller":"rafthttp/transport.go:346","msg":"unexpected removal of unknown remote peer","remote-peer-id":"799f6226de4c7bed","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/api/rafthttp/transport.go:346\ngo.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/api/rafthttp/transport.go:329\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:2301\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:2133\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:1357\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:1179\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\t/export/working/src/github.com/go.etcd.io/etcd/server/etcdserver/server.go:1111\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\t/export/working/src/github.com/go.etcd.io/etcd/pkg/schedule/schedule.go:157"}

@yangxuanjia
Copy link
Contributor Author

in func (c *RaftCluster) RemoveMember, if removing a removed node, do a warn log.
But in func (t *Transport) removePeer(id types.ID), if removing a removed node, there will panic.
there two actions in one situaction, is it right logic???

func (c *RaftCluster) RemoveMember(id types.ID, shouldApplyV3 ShouldApplyV3) {
	c.Lock()
	defer c.Unlock()
	if c.v2store != nil {
		mustDeleteMemberFromStore(c.lg, c.v2store, id)
	}
	if c.be != nil && shouldApplyV3 {
		mustDeleteMemberFromBackend(c.be, id)
	}
	c.lg.Error("11111111111111 c.members")
	for _, m := range c.members {
		c.lg.Error("id:", zap.String("ID", m.ID.String()))
	}
	c.lg.Error("222222222222222 c.members")

	m, ok := c.members[id]
	delete(c.members, id)
	c.removed[id] = true

	if ok {
		c.lg.Info(
			"removed member",
			zap.String("cluster-id", c.cid.String()),
			zap.String("local-member-id", c.localID.String()),
			zap.String("removed-remote-peer-id", id.String()),
			zap.Strings("removed-remote-peer-urls", m.PeerURLs),
		)
	} else {
		c.lg.Warn(
			"skipped removing already removed member",
			zap.String("cluster-id", c.cid.String()),
			zap.String("local-member-id", c.localID.String()),
			zap.String("removed-remote-peer-id", id.String()),
		)
	}
func (t *Transport) removePeer(id types.ID) {
	t.Logger.Error("333333333333333333 t.peers")
	for id, _ := range t.peers {
		t.Logger.Error("id:", zap.String("ID", id.String()))
	}
	t.Logger.Error("44444444444444444 t.peers")

	if peer, ok := t.peers[id]; ok {
		peer.stop()
	} else {
		if t.Logger != nil {
			t.Logger.Panic("unexpected removal of unknown remote peer", zap.String("remote-peer-id", id.String()))
		}
	}
	delete(t.peers, id)
	delete(t.LeaderStats.Followers, id.String())
	t.pipelineProber.Remove(id.String())
	t.streamProber.Remove(id.String())

	if t.Logger != nil {
		t.Logger.Info(
			"removed remote peer",
			zap.String("local-member-id", t.ID.String()),
			zap.String("removed-remote-peer-id", id.String()),
		)
	}
}

@ahrtr
Copy link
Member

ahrtr commented Nov 10, 2021

I did not reproduce this issue with etcd 3.5.0. What's the etcd version in your environment, and please provide the detailed steps/commands on how you produced the issue.

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Nov 10, 2021

I use the ETCD3.5 version, the detailed process is a bit complicated, and it involves a PR that we modified, I try to describe it clearly.
First, create a three-node cluster in computer room A, and then add a learner in computer room B. After that, the learner is forcibly promoted to the leader. This step involves a PR, and the purpose is to cross-computer room disaster recovery.
#13213

At this time, add 2 new nodes to the B computer room for this new leader to form a new cluster.
When computer room A is available, the previous 3 nodes will be added to this new cluster.
Then ticket the leader role to a node in computer room A, and finally delete 3 nodes in computer room B.
At this time, if the 3 nodes in the A computer room have a snapshot before the 3 nodes in the B computer room are deleted.
After deleting the 3 nodes in the B computer room, there will be 3 removeMember raft logs in the wal of the 3 nodes in the A computer room.
At this time, restart a node in the A computer room, and this error will be reported.
I feel that the reason for the error is that the peers in the B computer room are not deleted in the peers, but when restarting the playback WAL, I encountered the raft log of the removeMember, and then panic.

When I retry the entire process, if the entire process is executed step by step, before deleting the 3 nodes of the B computer room, there is no snapshot in the 3 nodes of the A computer room, then the wal of the A computer room will contain all the changeConfig, this When restarting the node in computer room A at time, all WALs will be played back, and this error will not be reported.

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Nov 10, 2021

This step is indeed a bit complicated, and the question to me is why the 3 nodes in computer room B are missing from peers, and it stands to reason before the 3 nodes in computer room B are deleted.
If the node in computer room A has made a snapshot, the snapshot should contain the information of 3 nodes in computer room B. At this time, delete the 3 nodes in computer room B, then restart the node in computer room A, and play back 3 removeMember normally.
Why is there no information about these 3 nodes in peers and then panic.

@ahrtr
Copy link
Member

ahrtr commented Nov 14, 2021

Did you see this issue with the official etcd 3.5.0 release, or you built the etcd binary with your PR on release-3.5?

I would suggest you to provide the detailed steps & commands on how you produced this issue, so as to avoid any misunderstandings.

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Nov 15, 2021

@ahrtr
I used the ETCD3.5 version to stably reproduce this problem, and the process is divided into the following steps:

  1. Establish a 3 node cluster A, B, C
  2. Write a few keys casually
  3. Make an automatic snapshot for each node
    use this code quickly do an auto snapshot: Do snapshot now #13474
  4. removeMember node C
  5. Restart node A, -----> Panic

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Nov 15, 2021

func recoverSnapshotBackend(cfg config.ServerConfig, oldbe backend.Backend, snapshot raftpb.Snapshot, beExist bool, hooks backend.Hooks) (backend.Backend, error) {
	consistentIndex := uint64(0)
	if beExist {
		consistentIndex, _ = cindex.ReadConsistentIndex(oldbe.BatchTx())
	}
        
        // code enter this branch!!!!!!!!!!!!!!!!!!  return oldbe  !!!!!!!!  > _ <
	if snapshot.Metadata.Index <= consistentIndex {
		return oldbe, nil
	}
	oldbe.Close()
	return openSnapshotBackend(cfg, snap.New(cfg.Logger, cfg.SnapDir()), snapshot, hooks)
}

The logic of the overall error is basically clear. The key is to take a snapshot before deleting the node.
Then the biggest problem is in the above code. If the index of the snapshot is less than the ConsistentIndex, start with the old backend. The old backend has the latest member information (including the deleted node information), but the WAL playback is based on the snapshotIndex. Play back at the starting position, so the removeMember raft changeConfig will be played back.

@yangxuanjia
Copy link
Contributor Author

During playback, removeMember() found that the node had been deleted, and recorded a warn log. And removePeer() found that the node had been deleted, and it was panic, which caused the node to restart and panic.

@ahrtr
Copy link
Member

ahrtr commented Nov 15, 2021

Yes, I reproduced this issue, and It indeed is a real issue!

The root cause of this issue is that the data is recovered from the db file because the snapshot.Metadata.Index is less than the consistentIndex (see backend.go#L104 ), while the WAL files are replayed against the latest snapshot file.

The default value for --snapshot-count is 100000, so the worse case is etcd may replay unnecessarily 99999 entries on startup. It seems that there is a performance improvement we can do.

cc @ptabor @hexfusion @serathius

The old backend has the latest member information (including the deleted node information)

@yangxuanjia It isn't true. The backend (the parameter oldbe for function recoverSnapshotBackend) contains all the members, but the removed member is included in the members_removed bucket.

$ go run cmd/bbolt/main.go keys ~/tmp/etcd/infra1.etcd/member/snap/db members
8211f1d0f64f3269
91bc3c398fb3c146

$ go run cmd/bbolt/main.go keys ~/tmp/etcd/infra1.etcd/member/snap/db members_removed
fd422379fda50e48

@yangxuanjia
Copy link
Contributor Author

@ahrtr
do you think this PR #13479 is a temp solution?

@ahrtr
Copy link
Member

ahrtr commented Nov 19, 2021

Actually this issue can't be reproduced on the latest code in release-3.5, the reason is it loads the members info from v2store firstly, see cluster.go#L259-L265. So the RaftCluster.members has 3 members, including the removed members.

But this issue can be easily reproduced on 3.5.0 and the latest code in main branch, because both of them load the members info from the db file firstly, see 3.5.0/cluster.go and main/cluster.go, so the RaftCluster.members has only 2 members, not including the removed member.

Basically change in PR is OK, but please re-organize the warning message per the comment.

@ahrtr
Copy link
Member

ahrtr commented Nov 19, 2021

FYI. I summarized this issue in the following page, https://github.com/ahrtr/etcd-issues/tree/master/issues/13466

@ahrtr
Copy link
Member

ahrtr commented Nov 22, 2021

@serathius has been working on a similar and bigger issue. The fix is already been included in 3.5.1, and it's the reason why this issue could not be reproduced on 3.5.1.

@serathius are you working the fix for 3.6? Will the v2store still be deprecated in 3.6?

@serathius
Copy link
Member

Please check StoreV2 deprecation plan in #12913

@stale
Copy link

stale bot commented Feb 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 21, 2022
@ahrtr
Copy link
Member

ahrtr commented Feb 21, 2022

The PR has already been merged, so closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants