Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd writing error in tests about missing confstate #18978

Open
4 tasks
serathius opened this issue Dec 1, 2024 · 6 comments · May be fixed by #19040 or #19060
Open
4 tasks

Etcd writing error in tests about missing confstate #18978

serathius opened this issue Dec 1, 2024 · 6 comments · May be fixed by #19040 or #19060

Comments

@serathius
Copy link
Member

Bug report criteria

What happened?

Periodic failures in https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-e2e-amd64
TestNoErrorLogsDuringNormalOperations/three_node_cluster_with_auto_tls_(peers) is failing with:

Messages:   	error level log message found: {"level":"error","ts":"2024-11-16T20:58:00.500222Z","caller":"version/monitor.go:120","msg":"failed to update storage version","cluster-version":"3.6.0","error":"cannot detect storage schema version: missing confstate information","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded\n\tgo.etcd.io/etcd/server/v3/etcdserver/version/monitor.go:120\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2286\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2467"}

Started at least before 16 Nov based on testgrid history
Example:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-e2e-amd64/1857887040968855552

What did you expect to happen?

Etcd should not write errors during startup

How can we reproduce it (as minimally and precisely as possible)?

Return test

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

@serathius
Copy link
Member Author

Possibly flakes start after we added a new test in #18819

@serathius
Copy link
Member Author

cc @ghouscht @ahrtr

@ghouscht
Copy link
Contributor

ghouscht commented Dec 2, 2024

I‘ll have a look later this week, thanks for notifying.

/assign

@ghouscht
Copy link
Contributor

Just a heads up; I had no time to look into this so far due to being sick. I hope I get to it this week.

@ahrtr
Copy link
Member

ahrtr commented Dec 12, 2024

@ghouscht Please follow #19040 (comment) to update the test case. Please let me know if you don't have the bandwidth, so that others can take over. Recently I see that some failures were caused by this, so I really want to get it resolved asap.

@ghouscht
Copy link
Contributor

Sorry for the delay, finally found some time to file a PR as you suggested: #19060

As far as I can see in testgrid the same failure happened to:

  • TestNoErrorLogsDuringNormalOperations/three_node_cluster
  • TestNoErrorLogsDuringNormalOperations/three_node_cluster_with_auto_tls_(peers)

But not:

  • TestNoErrorLogsDuringNormalOperations/single_node_cluster
  • TestNoErrorLogsDuringNormalOperations/three_node_cluster_with_auto_tls_(all)
  • TestNoErrorLogsDuringNormalOperations/three_node_cluster_with_auto_tls_(client)

However, I think the failure can happen in all three node cluster cases, thus the PR excludes the error from all the three node cases but not from the single node cluster case. I hope that is ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment