TalosControlPlane unable to scale down replaced node #193

omniproc · 2024-04-05T12:41:24Z

When doing a rolling update under certain conditions the update will never finish.
Steps to reproduce:

Create a new CAPI Management cluster that makes use of CACPPT
Use a single-node control-plane setup
Start a update of the control-plane machine by updating the TalosControlPlane resource

What happens:

TalosControlPlane starts a rolling update by creating a new Machine
The new Machine is created by whatever Infrastructure provider is used
etcd content is copied over from the old control-plane node to the new control-plane node
KubeAPI and other services are stopped on the old control-plane node and started on the new control plane. This includes the CACPPT operator
At this stage the rolling update is finished, but the TalosControlPlane resource is unable to scale down to 1 and never deletes the old Machine of the old control-plane node

How to solve the problem:

Manually delete the Machine of the old control-plane node. The used infrastructure provider then will handle the deletion of the node and the TalosControlPlane resource will scale down to 1 and become ready again.

What should happen:

When the CACPPT operator is restarted on the new control-plane node it should take over where the CACPPT operator that was stopped on the old control-plane node left and delete the Machine resource of the old control-plane node.

Note: this issue only happens if two conditions are met:

You run a rolling update of the CAPI management cluster (workload clusters are not impacted because the CACPPT operator only runs on the management cluster)
The control-plane consists of a single-node. Control-planes consisting of multiple nodes configured for high-availability have not been tested but I guess the problem won't be seen there since the CACPPT operator should never exit abruptly as with a single-node setup during "hand over" of the control-plane.

The text was updated successfully, but these errors were encountered:

smira · 2024-04-05T12:49:44Z

We don't recommend hosting CAPI components in the cluster managed by the same CAPI setup. It is going to cause various issues.

omniproc · 2024-04-05T13:34:25Z

@smira Thanks for the reply. Is that specifically mentioned somewhere in the docs?
It's a setup supported by CAPI in general so what issues did you observe with it?

smira · 2024-04-05T13:38:40Z

If your management cluster goes down for whatever reason, no easy way to recover. You can try this setup, but I would never recommend it.

omniproc · 2024-04-05T14:24:44Z

Well, sure. But that's a general design flaw of CAPI. It's even worse then this because kubernetes-sigs/cluster-api#7061 exists and it doesn't seem like there will be a fix for it anytime soon.
You could still use talosctl to get the management cluster up and running again, couldn't you? Besides that: having a etcd backup and restore process is another unrelated requirement for production systems i'd argue.

Preisschild · 2024-05-13T16:17:25Z

I think the Issue could be fixed by deleting the machine prior to gratefulEtcdLeave.

https://github.com/siderolabs/cluster-api-control-plane-provider-talos/blob/main/controllers/scale.go#L125

omniproc · 2024-05-18T21:46:11Z

I think the Issue could be fixed by deleting the machine prior to gratefulEtcdLeave.

https://github.com/siderolabs/cluster-api-control-plane-provider-talos/blob/main/controllers/scale.go#L125

The machine could be annotated with a CAPI pre-terminate lifecycle hook to block infraMachine deletion until gracefulEtcdLeave() is finished

I can confirm that the issue seems to be exactly that: the controller is waiting for the etcd to become healthy on 2 nodes (single control plane szenario in this case) which is only the case for a very short time. If the controller reconciles exactly during that time, the upgrade process will continue. Otherwise it will get stuck waiting for two nodes to become healthy while the old one is already being shut down:

controllers.TalosControlPlane verifying etcd health on all nodes {"node": "old", "node": "new"} controllers.TalosControlPlane rolling out control plane machines {"namespace": "default", "talosControlPlane": "xxx", "needRollout": ["new"]} controllers.TalosControlPlane waiting for etcd to become healthy before scaling down

Preisschild · 2024-05-19T19:33:40Z

kubernetes-sigs/cluster-api#2651

It seems that the Kubeadm Controlplane Provider had the same issue, but they fixed it (by, as far as I understand, marking controlplane nodes where etcd was stopped as healthy and thus if the loop is triggered again, the machine gets deleted)

Preisschild · 2024-07-16T10:30:37Z

I noticed today that this problem occurs whenever the capi-system capi-controller-manager dpeloyment is restarted when there is a controlplane rollout in progress.

It doesn't matter which workload cluster is beeing rollouted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TalosControlPlane unable to scale down replaced node #193

TalosControlPlane unable to scale down replaced node #193

omniproc commented Apr 5, 2024 •

edited

Loading

smira commented Apr 5, 2024

omniproc commented Apr 5, 2024

smira commented Apr 5, 2024

omniproc commented Apr 5, 2024

Preisschild commented May 13, 2024 •

edited

Loading

omniproc commented May 18, 2024 •

edited

Loading

Preisschild commented May 19, 2024

Preisschild commented Jul 16, 2024

TalosControlPlane unable to scale down replaced node #193

TalosControlPlane unable to scale down replaced node #193

Comments

omniproc commented Apr 5, 2024 • edited Loading

smira commented Apr 5, 2024

omniproc commented Apr 5, 2024

smira commented Apr 5, 2024

omniproc commented Apr 5, 2024

Preisschild commented May 13, 2024 • edited Loading

omniproc commented May 18, 2024 • edited Loading

Preisschild commented May 19, 2024

Preisschild commented Jul 16, 2024

omniproc commented Apr 5, 2024 •

edited

Loading

Preisschild commented May 13, 2024 •

edited

Loading

omniproc commented May 18, 2024 •

edited

Loading