-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flakes in clusterctl upgrade tests #11133
Comments
I'll have a look to the "Failed to create kind cluster" issue as I already noticed something similar on my own Kind setup and I think it's not isolated: kubernetes-sigs/kind#3554 - I guess it's something to fix upstream. EDIT: It seems to be an issue with inodes:
|
That sounds very suspicious regarding to Maybe would be a good start here to collect data about the actual used values :-) |
I don't know if we're running https://github.com/kubernetes/k8s.io/blob/3f2c06a3c547765e21dce65d0adcb1144a93b518/infra/aws/terraform/prow-build-cluster/resources/kube-system/tune-sysctls_daemonset.yaml#L4 there or not Also perhaps something else on the cluster is using a lot of them. |
I confirm the daemonset runs on the EKS cluster. |
Thanks folks for confirming that the daemon set is correctly setting the sysctl parameters - so the error might be elsewhere, I noticed something else while reading the logs1 of a failing test:
While on a non failing setup:
We can see that the Footnotes
|
It's possible? This part shouldn't really take long though .. I suspect that would be a noisy neighbor problem on the EKS cluster (I/O?) Doesn't explain the inotify-exhaustion like failures. |
We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h. We thought it's a nice way to save us time and the community money. Maybe we have to roll that back |
Do you remember when this change has been applied ? Those Kind failures seem to start by the end of August. |
That makes sense, ordinarily this part shouldn't take long, it doesn't need to fetch anything over the network and it should be pretty fast. But in a resource starved environment it might take to long. In that environment I would also expect Kubernetes to be unstable though, api-server/etcd will be timing out if we make it that far. |
carrying over some updates from the split off issue -- we have seen great improvements in the flakiness of the e2e tests after reverting the concurrency increase:
the updated plan/guidance for the rest of this release cycle re these e2e flakes is here: #11209 (comment) |
summarized by @chrischdi 🙇
According to aggregated failures of the last two weeks, we still have some flakyness on our clusterctl upgrade tests.
36 failures:split off into: clusterctl upgradeTimed out waiting for all Machines to exist
Timed out waiting for all Machines to exist
#1120916 Failures:
Failed to create kind cluster
14 Failures:
Internal error occurred: failed calling webhook [...] connect: connection refused
7 Failures:
x509: certificate signed by unknown authority
5 Failures:
Timed out waiting for Machine Deployment clusterctl-upgrade/clusterctl-upgrade-workload-... to have 2 replicas
2 Failures:
Timed out waiting for Cluster clusterctl-upgrade/clusterctl-upgrade-workload-... to provision
Link to check if messages changed or we have new flakes on clusterctl upgrade tests: here
/kind flake
The text was updated successfully, but these errors were encountered: