flakes in clusterctl upgrade tests #11133

cahillsf · 2024-09-03T21:09:58Z

summarized by @chrischdi 🙇

According to aggregated failures of the last two weeks, we still have some flakyness on our clusterctl upgrade tests.

~~36 failures: Timed out waiting for all Machines to exist~~ split off into: clusterctl upgrade Timed out waiting for all Machines to exist #11209
16 Failures: Failed to create kind cluster
- Component: e2e setup
- Branches:
  - main
  - release-1.7
14 Failures: Internal error occurred: failed calling webhook [...] connect: connection refused
- Component: CAPD
- Branches:
  - main
  - release-1.8
7 Failures: x509: certificate signed by unknown authority
- Component: unknown
- Branches:
  - main
  - release-1.8
  - release-1.7
5 Failures: Timed out waiting for Machine Deployment clusterctl-upgrade/clusterctl-upgrade-workload-... to have 2 replicas
- Component: unknown
- Branches:
  - release-1.8
  - main
2 Failures: Timed out waiting for Cluster clusterctl-upgrade/clusterctl-upgrade-workload-... to provision
- Component: unknown
- Branches:
  - release-1.8
  - main

Link to check if messages changed or we have new flakes on clusterctl upgrade tests: here

/kind flake

The text was updated successfully, but these errors were encountered:

tormath1 · 2024-09-04T12:42:02Z

I'll have a look to the "Failed to create kind cluster" issue as I already noticed something similar on my own Kind setup and I think it's not isolated: kubernetes-sigs/kind#3554 - I guess it's something to fix upstream.

EDIT: It seems to be an issue with inodes:

$ kind create cluster --retain --name=cluster3
Creating cluster "cluster3" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✗ Preparing nodes 📦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
$ podman logs -f 7eb0838e6bb2
...
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 12 (bookworm)!

Failed to create control group inotify object: Too many open files
Failed to allocate manager object: Too many open files
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

chrischdi · 2024-09-04T17:51:52Z

Failed to create control group inotify object: Too many open files
Failed to allocate manager object: Too many open files

That sounds very suspicious regarding to
https://main.cluster-api.sigs.k8s.io/user/troubleshooting.html?highlight=sysctl#cluster-api-with-docker----too-many-open-files

Maybe would be a good start here to collect data about the actual used values :-)

BenTheElder · 2024-09-04T18:53:50Z

"cluster": "eks-prow-build-cluster",

I don't know if we're running https://github.com/kubernetes/k8s.io/blob/3f2c06a3c547765e21dce65d0adcb1144a93b518/infra/aws/terraform/prow-build-cluster/resources/kube-system/tune-sysctls_daemonset.yaml#L4 there or not

Also perhaps something else on the cluster is using a lot of them.

ameukam · 2024-09-04T22:34:59Z

I confirm the daemonset runs on the EKS cluster.

tormath1 · 2024-09-05T12:31:53Z

Thanks folks for confirming that the daemon set is correctly setting the sysctl parameters - so the error might be elsewhere, I noticed something else while reading the logs¹ of a failing test:

$ cat journal.log | grep -i "Journal started"
Aug 30 06:35:27 clusterctl-upgrade-management-fba3o1-control-plane systemd-journald[95]: Journal started
$ cat journal.log | grep -i "multi-user"
Aug 30 06:35:51 clusterctl-upgrade-management-fba3o1-control-plane systemd[1]: Reached target multi-user.target - Multi-User System.

While on a non failing setup:

root@kind-control-plane:/# journalctl | grep -i "multi-user"
Sep 05 12:16:31 kind-control-plane systemd[1]: Reached target multi-user.target - Multi-User System.
root@kind-control-plane:/# journalctl | grep -i "Journal started"
Sep 05 12:16:31 kind-control-plane systemd-journald[98]: Journal started

We can see that the multi-user.target² is reached at the same time as the journal started to log. On a failing test, there is a already 24 seconds of difference. I'm wondering if randomly (under heavy load) we don't reach the 30 seconds of timeout³ for reaching the multi-user.target hence the failure.

BenTheElder · 2024-09-09T23:03:09Z

It's possible? This part shouldn't really take long though ..

I suspect that would be a noisy neighbor problem on the EKS cluster (I/O?)

Doesn't explain the inotify-exhaustion like failures.

sbueringer · 2024-09-10T06:16:14Z

We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h.

We thought it's a nice way to save us time and the community money.

Maybe we have to roll that back

tormath1 · 2024-09-10T09:16:02Z

We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h.

We thought it's a nice way to save us time and the community money.

Maybe we have to roll that back

Do you remember when this change has been applied ? Those Kind failures seem to start by the end of August.

sbueringer · 2024-09-10T10:20:59Z

main: #11067
release-1.8: #11144

cahillsf · 2024-09-24T20:47:16Z

related note ( i split off one of the messages into a separate issue #11209 just so this conversation didn't get too convoluted) -- we are going to try reverting the concurrency increase and see if the situation improves

#11220
#11222

BenTheElder · 2024-09-24T21:35:22Z

We can see that the multi-user.target2 is reached at the same time as the journal started to log. On a failing test, there is a already 24 seconds of difference. I'm wondering if randomly (under heavy load) we don't reach the 30 seconds of timeout3 for reaching the multi-user.target hence the failure.

That makes sense, ordinarily this part shouldn't take long, it doesn't need to fetch anything over the network and it should be pretty fast.

But in a resource starved environment it might take to long. In that environment I would also expect Kubernetes to be unstable though, api-server/etcd will be timing out if we make it that far.

cahillsf · 2024-10-01T15:41:14Z

carrying over some updates from the split off issue -- we have seen great improvements in the flakiness of the e2e tests after reverting the concurrency increase:

clusterctl upgrade Timed out waiting for all Machines to exist #11209 (comment)
clusterctl upgrade Timed out waiting for all Machines to exist #11209 (comment)

the updated plan/guidance for the rest of this release cycle re these e2e flakes is here: #11209 (comment)

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 3, 2024

cahillsf mentioned this issue Sep 3, 2024

clusterctl upgrade tests are flaky #9688

Closed

fabriziopandini added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 5, 2024

k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Sep 5, 2024

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Sep 5, 2024

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 5, 2024

fabriziopandini added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 5, 2024

sbueringer mentioned this issue Sep 10, 2024

🐛 Make KCP pre-terminate hook more robust #11161

Merged

sbueringer mentioned this issue Sep 20, 2024

clusterctl upgrade Timed out waiting for all Machines to exist #11209

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flakes in clusterctl upgrade tests #11133

flakes in clusterctl upgrade tests #11133

cahillsf commented Sep 3, 2024 •

edited

Loading

tormath1 commented Sep 4, 2024 •

edited

Loading

chrischdi commented Sep 4, 2024

BenTheElder commented Sep 4, 2024

ameukam commented Sep 4, 2024

tormath1 commented Sep 5, 2024

BenTheElder commented Sep 9, 2024

sbueringer commented Sep 10, 2024

tormath1 commented Sep 10, 2024

sbueringer commented Sep 10, 2024

cahillsf commented Sep 24, 2024

BenTheElder commented Sep 24, 2024 •

edited

Loading

cahillsf commented Oct 1, 2024

flakes in clusterctl upgrade tests #11133

flakes in clusterctl upgrade tests #11133

Comments

cahillsf commented Sep 3, 2024 • edited Loading

tormath1 commented Sep 4, 2024 • edited Loading

chrischdi commented Sep 4, 2024

BenTheElder commented Sep 4, 2024

ameukam commented Sep 4, 2024

tormath1 commented Sep 5, 2024

Footnotes

BenTheElder commented Sep 9, 2024

sbueringer commented Sep 10, 2024

tormath1 commented Sep 10, 2024

sbueringer commented Sep 10, 2024

cahillsf commented Sep 24, 2024

BenTheElder commented Sep 24, 2024 • edited Loading

cahillsf commented Oct 1, 2024

cahillsf commented Sep 3, 2024 •

edited

Loading

tormath1 commented Sep 4, 2024 •

edited

Loading

BenTheElder commented Sep 24, 2024 •

edited

Loading