-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop writing v2store and validate confChange against v3store instead of v2store #17019
base: main
Are you sure you want to change the base?
Conversation
568a2c7
to
4f1f945
Compare
#16084 is superseded by this PR. |
4f1f945
to
7fba6fb
Compare
// TODO: this must be switched to backend as well. | ||
membersMap, removedMap := membersFromStore(c.lg, c.v2store) | ||
func (c *RaftCluster) ValidateConfigurationChange(cc raftpb.ConfChange, shouldApplyV3 ShouldApplyV3) error { | ||
// It makes no sense to validate a confChange if we don't apply it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not true, we are still applying confChange as long as we bootstrap from snapshot and not consistent index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is a longstanding design problem to me.
Current status summary
- in older etcd versions ( < 3.4?) we load all data from v2store, and replay WAL entries on top of the snapshot. All work well.
- Starting from 3.5, we load normal data from v3store and membership data still from v2store, and replay WAL entries on top of v2 snapshot. It works well for conf change. A little ugly for normal key/value data, but it still works.
- Starting from 3.6 (
main
branch for now), we load all data from v3store on startup, but we still replay WAL entries on top of v2 snapshot.- For normal key/value data, etcd skips already applied entries; it works although ugly.
- For conf change, the principle of "
skiping already applied entries
" doesn't work anymore.- For etcd, it already loaded membership data from v3store, so etcd can just skip already applied entries.
- But raft doesn't know the latest membership data, we still depend on the raft loop to re-apply the conf change (with the old entry index < consistent-index), so as to notify raft all the membership data. So etcdd has to call raft's ApplyConfChange even etcd just skips it
Solutions
Short-term solutions
- Validate the conf change against v2store if it is already in v3store (shouldApplyV3 == false)
- apply conf change (read from v3store) to raft directly when restarting etcd members to bypass the raft loop? (need investigation)
Long-term solutions
- Bootstrap etcd from consistent_index.
- It works for key/value data.
- For conf change, it still needs to apply the conf change to raft directly when etcd restarting to bypass the raft loop.
- Generate v3 snapshot periodically (similar to what etcd did on v2 snapshot), and bootstrap etcd from v3 snapshot, and then replay WAL entries on top of it.
- need to evaluate the performance of generate v3snapshot on huge db file (e.g. 8GB)
- The disk space usage will be multiple times bigger, depending on how many snapshot files to keep.
- Then why do we need the latest bbolt db file in such case?
- Generate v3 snapshot periodically (similar to what etcd did on v2 snapshot), but bootstrap etcd from consistent_index. The v3 snapshot files are just for corruption recovery, especially in single-node edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate the conf change against v2store if it is already in v3store (shouldApplyV3 == false)
I know, #17017
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply conf change (read from v3store) to raft directly when restarting etcd members to bypass the raft loop? (need investigation)
I have draft implementation for that. #17022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to evaluate the performance of generate v3snapshot on huge db file (e.g. 8GB)
v3 db snapshot is not a thing. That's an incompatible concept. Recommend reading the raft paper on in-memory snapshotable storage (etcd v2) vs persistent one (etcd v3).
v3store will be a source of truth for membership, but it still is not for Let me add a test for this code too. |
Done, PTAL #17021 |
We already generate v2 snapshot using v3store to be compatible with 3.5. It makes no sense to write any data into v2store, which isn't used at all. Signed-off-by: Benjamin Wang <[email protected]>
Signed-off-by: Benjamin Wang <[email protected]>
Signed-off-by: Benjamin Wang <[email protected]>
Signed-off-by: Benjamin Wang <[email protected]>
d3b20d2
to
9881db0
Compare
@@ -517,52 +517,6 @@ func TestNodeToMemberBad(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestClusterAddMember(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please migrate such tests v3 instead of just removing them
@ahrtr: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Part of #12913