-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClusterAPI Machine stuck in "Pending" indefinitely #1222
Comments
There's no way we can guess this. As always in CAPI, it makes sense to inspect all states of all resources.
|
cabpt-controller-manager-5687d76d6f-55xg6 manager 2023-10-16T10:36:15Z INFO Starting Controller {"controller": "talosconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "TalosConfig"}
cabpt-controller-manager-5687d76d6f-55xg6 manager W1016 10:36:15.626507 1 reflector.go:533] /.cache/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1beta1.MachinePool: machinepools.cluster.x-k8s.io is forbidden: User "system:serviceaccount:cabpt-system:default" cannot list resource "machinepools" in API group "cluster.x-k8s.io" at the cluster scope After looking through logs, this seems like it might be an RBAC issue maybe?
NAME READY SEVERITY REASON SINCE MESSAGE
Cluster/cluster-0 False Error BootstrapTemplateCloningFailed 16m Failed to create bootstrap configuration: Internal error occurred: failed calling webhook "vtaloscon ...
├─ClusterInfrastructure - MetalCluster/cluster-0
└─ControlPlane - TalosControlPlane/cluster-0-cp False Error BootstrapTemplateCloningFailed 16m Failed to create bootstrap configuration: Internal error occurred: failed calling webhook "vtaloscon ...
└─Machine/cluster-0-cp-szn6c False Info WaitingForInfrastructure 15m 0 of 2 completed
├─BootstrapConfig - TalosConfig/cluster-0-cp-f5jj9
└─MachineInfrastructure - MetalMachine/cluster-0-cp-r8df7 |
looks like it's the failure to call a webhook, probably MachinePools is a different issue either way as it works in Sidero integration tests, something is up with your setup (?) |
Mhm it's a fresh setup, and the (what i think to be the) same setup worked with 0.5.8 and 0.6.0 (with firewall configured to block port 67 and 68) 🤔 I'm running Clusterctl Version 1.5.2 and K8s version 1.27.6 I was just looking at the cabpt-controller-manager because it appears to crash every few minutes E1016 11:00:17.100532 1 reflector.go:148] /.cache/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1beta1.MachinePool: failed to list *v1beta1.MachinePool: machinepools.cluster.x-k8s.io is forbidden: User "system:serviceaccount:cabpt-system:default" cannot list resource "machinepools" in API group "cluster.x-k8s.io" at the cluster scope
2023-10-16T11:01:08Z ERROR Could not wait for Cache to sync {"controller": "talosconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "TalosConfig", "error": "failed to wait for talosconfig caches to sync: timed out waiting for cache to be synced for Kind *v1alpha3.TalosConfig"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1
/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:207
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:233
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
/.cache/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:219 |
Info: i added this rule to the cabpt-manager-role ClusterRole, now it appears to work rules:
- apiGroups:
- '*'
resources:
- '*'
verbs:
- '*' To me it seems like an RBAC issue, though its unclear yet, why that is. One possibilty might be that ClusterAPI changed their apiGroups for some resources, but i guess that would be noted as a breaking change. Just from looking at Note: I also have the ClusterAPI Provider for Azure installed, maybe there is a conflict in apiGroups? 🤔 |
After a full reinstallation of Sidero Metal with CAPI etc. I have the problem that my machine doesnt boot into Talos.
The first PXE boot worked perfectly, discovery etc. worked, also the BMC entry is present (and works, tested with ipmitool) in the server. However I now applied the cluster manifests, my machine doesn't boot (not when I manually boot, also not via IPMI). It seems like Sidero doesnt use the (available) server (see output below) since its not allocated.
Any ideas?
I also found this log line:
The text was updated successfully, but these errors were encountered: