Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to generate v2 snapshot from v3 state #16418

Merged
merged 1 commit into from
Aug 22, 2023
Merged

Conversation

geetasg
Copy link

@geetasg geetasg commented Aug 14, 2023

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
Related to #12913

e2e test is TestDowngradeUpgradeClusterOf3WithSnapshot

@ahrtr
Copy link
Member

ahrtr commented Aug 15, 2023

Please take a look at the e2e failures.

e.g.,
FAIL: TestV2DeprecationSnapshotMatches

@geetasg
Copy link
Author

geetasg commented Aug 15, 2023

Investigation from test failure showed that the test code change was not checked in. Current diff now has the test change. Some clarifications on the test change are listed below

  • Why did the test fail?
    The test originally compared the indices on v2store and events on the v2store. Since now the snapshot is created using a brand new v2store, these wont match. But the membership data stored in the snapshot including valid and removed members should match. The test needs to be updated to verify the membership.

  • Why is this diff removing version replacement code from the test similar to bytes.Replace(data, []byte("3.5.0"), []byte("X.X.X"), -1)?
    The new verification method constructs a RaftCluster object. The version string needs to be valid for the clusterVersionFromStore method to be successful. Otherwise, following failure is seen

testing.tRunner.func1.2({0xe467c0, 0xc0007734d0})
        /usr/local/go/src/testing/testing.go:1526 +0x24e
testing.tRunner.func1()
        /usr/local/go/src/testing/testing.go:1529 +0x39f
panic({0xe467c0, 0xc0007734d0})
        /usr/local/go/src/runtime/panic.go:884 +0x213
github.com/coreos/go-semver/semver.Must(...)
        /root/go/pkg/mod/github.com/coreos/[email protected]/semver/semver.go:65
go.etcd.io/etcd/server/v3/etcdserver/api/membership.clusterVersionFromStore(0x40fe4a?, {0x10f6cb8, 0xc0002ec2a0})
        /home/ec2-user/geetasg/etcd/server/etcdserver/api/membership/storev2.go:222 +0x34e
go.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).Recover(0xc0002ec460, 0xff2648)
        /home/ec2-user/geetasg/etcd/server/etcdserver/api/membership/cluster.go:267 +0x169
go.etcd.io/etcd/tests/v3/e2e.assertMembershipEqual({0x10f71e8?, 0xc0007024e0}, {0x10f6cb8?, 0xc0002ec2a0}, {0x10f6cb8?, 0xc0002ec380})
  • Why is this diff updating the way member ids are replaced by the test?
    Before the diff, the member ids from the data were being replaced as -
data = bytes.Replace(data, []byte(fmt.Sprintf("%x", mid)), []byte(fmt.Sprintf("member%d", i+1)), -1)

With the change, the test logic will replace it with integer

data = bytes.Replace(data, []byte(fmt.Sprintf("%x", mid)), []byte(fmt.Sprintf("%d", i+1)), -1)

The reason for this update is that the new verification creates a RaftCluster object from the snapshot. It parses the member id and expects a valid id to not contain characters. If we do not update the logic for member ids as shown above, the following failure is seen

        /root/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc00043d41b?, {0xf9cbf1?, 0xc0007ad860?}, {0xc000bd05c0, 0x1, 0x1})
        /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:258 +0x59
go.etcd.io/etcd/server/v3/etcdserver/api/membership.MustParseMemberIDFromKey(0xc0007ad860?, {0xc00043d410?, 0x60?})
        /home/ec2-user/geetasg/etcd/server/etcdserver/api/membership/store.go:54 +0x16f
go.etcd.io/etcd/server/v3/etcdserver/api/membership.nodeToMember(0xe2ba80?, 0xc000bfa780)
        /home/ec2-user/geetasg/etcd/server/etcdserver/api/membership/storev2.go:169 +0x45
go.etcd.io/etcd/server/v3/etcdserver/api/membership.membersFromStore(0xa?, {0x10f6cb8, 0xc00069cd20})
        /home/ec2-user/geetasg/etcd/server/etcdserver/api/membership/cluster.go:674 +0x305
go.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).Recover(0xc00069cee0, 0xff2648)
        /home/ec2-user/geetasg/etcd/server/etcdserver/api/membership/cluster.go:268 +0x1ac
go.etcd.io/etcd/tests/v3/e2e.assertMembershipEqual({0x10f71e8?, 0xc0001c5520}, {0x10f6cb8?, 0xc00069cd20}, {0x10f6cb8?, 0xc00069cd90})

@geetasg geetasg marked this pull request as draft August 15, 2023 17:40
@geetasg geetasg marked this pull request as ready for review August 15, 2023 22:04
@ahrtr
Copy link
Member

ahrtr commented Aug 16, 2023

Overall looks good to me, with just a minor comment.

@geetasg
Copy link
Author

geetasg commented Aug 16, 2023

updated as per review feedback

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you!

@@ -250,7 +251,23 @@ func assertSnapshotsMatch(t testing.TB, firstDataDir, secondDataDir string, patc
if err != nil {
t.Fatal(err)
}
assert.Equal(t, openSnap(patch(firstSnapshot.Data)), openSnap(patch(secondSnapshot.Data)))
assertMembershipEqual(t, openSnap(patch(firstSnapshot.Data)), openSnap(patch(secondSnapshot.Data)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this? Doesn't seem related changing the logic, looks like additional debug info? Could you split unrelated changes to PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refer to the comment above - #16418 (comment). The events on v2store and indices on v2store wont match between the two snapshots anymore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The events on v2store and indices on v2store wont match between the two snapshots anymore

Not sure I understand. It's not clear from code changes. Please separate a PR that updates the test. PR should be organized in a way to clearly show what is changing, if we need to change validation, let's do that separately from business logic.

Copy link
Author

@geetasg geetasg Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding more clarification on the test change -
Original test was checking that the v2 snapshot content created by older and newer version matched wrt indices on the v2store and events on the v2 store. With the new logic, the indices and events on v2store will not match. As a result we saw a test failure. The test is being updated to verify membership match instead. Please let me know if you have more questions. I would like to confirm that we are on same page wrt this change in the test.

Created separate PR for test changes as per review feedback above (#16441) . The test change will need to be merged before this PR can pass. Converting this one to draft and removing the test change. The test will start failing for this PR. I will rebase this one once the test change is merged.

@serathius @ahrtr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not understand why we have to create a separate PR to change the validation. We only care about the membership data in the snapshot, it's reasonable to only compare the membership data.

Anyway, approved the other PR #16441.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both. Separate CR was used for test code as per review feedback.

@geetasg geetasg marked this pull request as draft August 17, 2023 17:01
@geetasg geetasg force-pushed the pr7 branch 2 times, most recently from 3590ce6 to 29ceb28 Compare August 17, 2023 17:36
@geetasg geetasg marked this pull request as ready for review August 17, 2023 18:32
@geetasg geetasg marked this pull request as draft August 18, 2023 16:29
@ahrtr
Copy link
Member

ahrtr commented Aug 21, 2023

Please squash the commits (Note: rebase PR instead of merging main into dev branch next time).

@geetasg
Copy link
Author

geetasg commented Aug 21, 2023

updated commits

@geetasg geetasg marked this pull request as ready for review August 21, 2023 21:14
@@ -2075,16 +2074,12 @@ func (s *EtcdServer) snapshot(snapi uint64, confState raftpb.ConfState) {
// So KV().Commit() cannot run in parallel with toApply. It has to be called outside
// the go routine created below.
s.KV().Commit()
d := GetMembershipInfoInV2Format(s.Logger(), s.cluster)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall note about code changes. Please avoid unnecessary reordering between functions in PR, it adds additional review strain. Like in here between s.KV().Commit(), GetMembershipInfoInV2Format(s.Logger(), s.cluster). I don't think it matters here, but takes a minute or two to think about this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #16460 to keep the order same to simplify reasoning / avoid the unnecessary strain.

@serathius
Copy link
Member

serathius commented Aug 22, 2023

LGTM, but I would like to double check something. This change mostly depends on TestDowngradeUpgradeClusterOf* for testing. We downgrade the cluster and expect it to work with and without snapshot.

However I noticed that we are not testing contents of membership correctly. We confirm that members before are equal members after, however we don't really do any member changes. So there is no possibility that would allow membership to differ.

Will take a look into updating the test.

@serathius
Copy link
Member

PTAL #16457

@serathius
Copy link
Member

Validated that this PR works with #16457, so we can proceed.
Still, please add your suggestions about more testing for V3 -> V2 snapshot generation. We need to be 100% sure this is correct, and the only way to do that is proper e2e testing.

@serathius serathius merged commit f3cc759 into etcd-io:main Aug 22, 2023
@geetasg geetasg mentioned this pull request Sep 10, 2023
31 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants