[Flake] Etcd timeout -> leader election failure -> webhook down #1743
Labels
kind/flake
Categorizes issue or PR as related to a flaky test.
triage/accepted
Indicates an issue is ready to be actively worked on.
This issue is mostly to document and keep track of the test failures. The issue is not with BMO itself, rather a performance issue in the CI system.
Which jobs are flaking
Possibly all running on Jenkins workers in Xerces.
It has been observed in BMO e2e tests at least.
Reason for failure (if possible):
Occasionally we see tests fail with a
failed to call webhook
(see logs below) even though the webhook was working just before and no changes were made to it. Checking the BMO logs reveal that the issue is with etcd. BMO is unable to renew its lease or perform leader election. As a result, it stops and then restarts. This is why the webhook is refusing connection.Test logs:
BMO logs:
Anything else you would like to add:
We could possibly workaround or at least improve this by disabling leader election. I don't think this is a good idea though, since we may just be pushing the issue further and make it even harder to realize why tests fail.
The only real solution is to ensure that the CI environment is performant enough to avoid these flakes.
/kind flake
The text was updated successfully, but these errors were encountered: