[Flake] Etcd timeout -> leader election failure -> webhook down #1743

lentzi90 · 2024-05-22T05:51:11Z

This issue is mostly to document and keep track of the test failures. The issue is not with BMO itself, rather a performance issue in the CI system.

Which jobs are flaking

Possibly all running on Jenkins workers in Xerces.
It has been observed in BMO e2e tests at least.

Reason for failure (if possible):

Occasionally we see tests fail with a failed to call webhook (see logs below) even though the webhook was working just before and no changes were made to it. Checking the BMO logs reveal that the issue is with etcd. BMO is unable to renew its lease or perform leader election. As a result, it stops and then restarts. This is why the webhook is refusing connection.

Test logs:

[2024-05-21T02:06:02.489Z] • [FAILED] [7.212 seconds]
[2024-05-21T02:06:02.489Z] Inspection [It] should inspect a newly created BMH [required, inspection]
[2024-05-21T02:06:02.489Z] /home/metal3ci/workspace/metal3-bmo-e2e-test-periodic-release-0.6/test/e2e/inspection_test.go:85
[2024-05-21T02:06:02.489Z] 
[2024-05-21T02:06:02.489Z]   Timeline >>
[2024-05-21T02:06:02.489Z]   INFO: Creating namespace inspection-wcmx49
[2024-05-21T02:06:02.489Z]   INFO: Creating event watcher for namespace "inspection-wcmx49"
[2024-05-21T02:06:02.489Z]   STEP: Creating a secret with BMH credentials @ 05/21/24 02:05:55.385
[2024-05-21T02:06:02.489Z]   STEP: creating a BMH @ 05/21/24 02:05:55.761
[2024-05-21T02:06:02.489Z]   [FAILED] in [It] - /home/metal3ci/workspace/metal3-bmo-e2e-test-periodic-release-0.6/test/e2e/inspection_test.go:110 @ 05/21/24 02:05:55.79
[2024-05-21T02:06:02.489Z]   INFO: Deleting namespace inspection-wcmx49
[2024-05-21T02:06:02.489Z]   << Timeline
[2024-05-21T02:06:02.489Z] 
[2024-05-21T02:06:02.489Z]   [FAILED] Unexpected error:
[2024-05-21T02:06:02.489Z]       <*errors.StatusError | 0xc000383180>: 
[2024-05-21T02:06:02.489Z]       Internal error occurred: failed calling webhook "baremetalhost.metal3.io": failed to call webhook: Post "[https://baremetal-operator-webhook-service.baremetal-operator-system.svc:443/validate-metal3-io-v1alpha1-baremetalhost?timeout=10s](https://baremetal-operator-webhook-service.baremetal-operator-system.svc/validate-metal3-io-v1alpha1-baremetalhost?timeout=10s)": dial tcp 10.99.193.108:443: connect: connection refused

BMO logs:

E0521 02:09:39.324811       1 leaderelection.go:369] Failed to update lock: etcdserver: request timed out
E0521 02:09:42.298968       1 leaderelection.go:332] error retrieving resource lock baremetal-operator-system/baremetal-operator: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/baremetal-operator-system/leases/baremetal-operator": context deadline exceeded
I0521 02:09:42.299218       1 leaderelection.go:285] failed to renew lease baremetal-operator-system/baremetal-operator: timed out waiting for the condition
{"level":"info","ts":1716257382.3178487,"msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":1716257382.3179276,"msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":1716257382.3179662,"msg":"Stopping and waiting for caches"}
{"level":"info","ts":1716257382.3181348,"msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":1716257382.3182237,"msg":"Stopping and waiting for HTTP servers"}
{"level":"info","ts":1716257382.3182437,"msg":"Wait completed, proceeding to shutdown the manager"}

Anything else you would like to add:

We could possibly workaround or at least improve this by disabling leader election. I don't think this is a good idea though, since we may just be pushing the issue further and make it even harder to realize why tests fail.
The only real solution is to ensure that the CI environment is performant enough to avoid these flakes.

/kind flake

The text was updated successfully, but these errors were encountered:

Rozzii · 2024-05-29T13:50:08Z

/triage accepted

metal3-io-bot · 2024-08-27T14:42:58Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Rozzii · 2024-09-11T14:29:23Z

/remove-lifecycle stale

Rozzii · 2024-09-11T14:30:07Z

@Sunnatillo

metal3-io-bot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels May 22, 2024

metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels May 29, 2024

metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2024

metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flake] Etcd timeout -> leader election failure -> webhook down #1743

[Flake] Etcd timeout -> leader election failure -> webhook down #1743

lentzi90 commented May 22, 2024

Rozzii commented May 29, 2024

metal3-io-bot commented Aug 27, 2024

Rozzii commented Sep 11, 2024

Rozzii commented Sep 11, 2024

[Flake] Etcd timeout -> leader election failure -> webhook down #1743

[Flake] Etcd timeout -> leader election failure -> webhook down #1743

Comments

lentzi90 commented May 22, 2024

Rozzii commented May 29, 2024

metal3-io-bot commented Aug 27, 2024

Rozzii commented Sep 11, 2024

Rozzii commented Sep 11, 2024