Etcd writing error in tests about missing confstate #18978

serathius · 2024-12-01T17:58:50Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

Periodic failures in https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-e2e-amd64
TestNoErrorLogsDuringNormalOperations/three_node_cluster_with_auto_tls_(peers) is failing with:

Messages:   	error level log message found: {"level":"error","ts":"2024-11-16T20:58:00.500222Z","caller":"version/monitor.go:120","msg":"failed to update storage version","cluster-version":"3.6.0","error":"cannot detect storage schema version: missing confstate information","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded\n\tgo.etcd.io/etcd/server/v3/etcdserver/version/monitor.go:120\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2286\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2467"}

Started at least before 16 Nov based on testgrid history
Example:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-e2e-amd64/1857887040968855552

What did you expect to happen?

Etcd should not write errors during startup

How can we reproduce it (as minimally and precisely as possible)?

Return test

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

serathius · 2024-12-01T18:00:09Z

Possibly flakes start after we added a new test in #18819

serathius · 2024-12-01T18:00:22Z

cc @ghouscht @ahrtr

ghouscht · 2024-12-02T13:54:23Z

I‘ll have a look later this week, thanks for notifying.

/assign

ghouscht · 2024-12-11T11:41:38Z

Just a heads up; I had no time to look into this so far due to being sick. I hope I get to it this week.

ahrtr · 2024-12-12T14:53:54Z

@ghouscht Please follow #19040 (comment) to update the test case. Please let me know if you don't have the bandwidth, so that others can take over. Recently I see that some failures were caused by this, so I really want to get it resolved asap.

ghouscht · 2024-12-13T14:59:56Z

Sorry for the delay, finally found some time to file a PR as you suggested: #19060

As far as I can see in testgrid the same failure happened to:

TestNoErrorLogsDuringNormalOperations/three_node_cluster
TestNoErrorLogsDuringNormalOperations/three_node_cluster_with_auto_tls_(peers)

But not:

TestNoErrorLogsDuringNormalOperations/single_node_cluster
TestNoErrorLogsDuringNormalOperations/three_node_cluster_with_auto_tls_(all)
TestNoErrorLogsDuringNormalOperations/three_node_cluster_with_auto_tls_(client)

However, I think the failure can happen in all three node cluster cases, thus the PR excludes the error from all the three node cases but not from the single node cluster case. I hope that is ok.

serathius added the type/bug label Dec 1, 2024

serathius added the help wanted label Dec 1, 2024

k8s-ci-robot assigned ghouscht Dec 2, 2024

jmhbnz added the area/testing label Dec 5, 2024

ahrtr linked a pull request Dec 11, 2024 that will close this issue

Delay the monitorStorageVersion goroutine until the server is fully ready #19040

Open

ghouscht linked a pull request Dec 13, 2024 that will close this issue

fix(e2e): ignore error log about failed storage update #19060

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd writing error in tests about missing confstate #18978

Etcd writing error in tests about missing confstate #18978

serathius commented Dec 1, 2024

paste your configuration here

serathius commented Dec 1, 2024

serathius commented Dec 1, 2024

ghouscht commented Dec 2, 2024

ghouscht commented Dec 11, 2024

ahrtr commented Dec 12, 2024 •

edited

Loading

ghouscht commented Dec 13, 2024

Etcd writing error in tests about missing confstate #18978

Etcd writing error in tests about missing confstate #18978

Comments

serathius commented Dec 1, 2024

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

serathius commented Dec 1, 2024

serathius commented Dec 1, 2024

ghouscht commented Dec 2, 2024

ghouscht commented Dec 11, 2024

ahrtr commented Dec 12, 2024 • edited Loading

ghouscht commented Dec 13, 2024

ahrtr commented Dec 12, 2024 •

edited

Loading