Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[design] Integration with Fedora CoreOS auto-updates (Zincati strategy) #216

Closed
lucab opened this issue Oct 30, 2020 · 13 comments
Closed

[design] Integration with Fedora CoreOS auto-updates (Zincati strategy) #216

lucab opened this issue Oct 30, 2020 · 13 comments

Comments

@lucab
Copy link

lucab commented Oct 30, 2020

This ticket may or may not be a subset of #122.

I'd like to explore the design space to see how kured and Fedora CoreOS (FCOS) auto-updates could be plugged together. In particular, I'd like to see kured as the in-cluster reboot coordinator directing the local update-agent (Zincati) on each node. This possibly requires some enhancements on both sides, and I'll be happy to drive the changes on FCOS side.

Zincati core logic implements a finite-state machine: https://github.com/coreos/zincati/blob/v0.0.13/docs/images/zincati-fsm.png
One key point here is that an update is first staged, and only later explicitly finalized (with a reboot) based on the locally configured update strategy. This means that a random reboot (i.e. not triggered by Zincati) will not apply a staged update.
It currently offers a few upgrade strategies: https://coreos.github.io/zincati/usage/updates-strategy/

One option here could be kured to implement the FleetLock protocol, and point Zincati at it.

Another option could be that a new strategy is added to Zincati to signal the staged update with a file compatible with --reboot-sentinel, and then kured reboot logic is made configurable to write a file (allowing Zincati to finalize the update) instead of rebooting.

@lukasmrtvy
Copy link

@lucab Look at https://github.com/poseidon/fleetlock ( few prom metrics, no notificiations )

@lucab
Copy link
Author

lucab commented Nov 9, 2020

@lukasmrtvy thanks for the pointer, I'm aware of that project.

For context, I'm not looking into alternative solutions for "how to orchestrate Fedora CoreOS cluster-reboots".
I do contribute to Fedora CoreOS (and Zincati) development, and I'm specifically interested in hearing what can be done on our side to properly integrate kured, for FCOS users that may want to use it.

@lukasmrtvy
Copy link

@lucab ah, thanks. Maybe would be better to catch devs on Slack (https://slack.weave.works) #kured

@github-actions
Copy link

github-actions bot commented Jan 9, 2021

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@mysticaltech
Copy link

@lucab That was a great initiative, were you able to properly use kured with zincati to auto-upgrade Kube nodes?

@lucab
Copy link
Author

lucab commented Jan 26, 2022

@mysticaltech not really, there didn't seem to be much interest in writing down the requirements for a clear interfaces between the two pieces.
My offer to adapt Zincati for this is still open, but I don't have much knowledge about kured nor I use it, so I still need somebody on this project to explore approaches and write down the findings.

@mysticaltech
Copy link

mysticaltech commented Jan 26, 2022

Thank you @lucab, good to know that's still an option! 🙏

If curious, please have a look how Kubic supports kured:

https://en.opensuse.org/Kubic:Update_and_Reboot#Kubernetes_Reboot_Daemon

https://github.com/openSUSE/transactional-update/blob/master/sbin/transactional-update.in#L323

@lucab
Copy link
Author

lucab commented Jan 26, 2022

By design, the "touch a file and let kured reboot" approach does not work in this case.

A random reboot (either triggered by kured, or by an external event like a power loss or a cloud-maintenance-forced-reboot) does not result in any spurious upgrade being applied.
That is a deliberate choice to avoid unexpected time-bombs when going through unrelated reboots (lesson learned after a huge amount of puzzled support tickets).
The FCOS upgrade flow has an explicit atomic finalize-and-reboot step, and it is usually performed by Zincati. As stated above, kured should be somehow tweaked to signal when that should happen, instead of trying to reboot the node.

For all the details, see my initial comment.

@mysticaltech
Copy link

mysticaltech commented Jan 26, 2022

Thanks for this clarification.

Kured can execute periodically a sentinel command, if the command exits with status 0 (success), it reboots, if it exits with status non 0, it does not reboot. The good thing is that thanks to bash, it's easy to take the output of any command and canalize the answer into either a success or failure exit event.

Does Zincati have a command that would inform of the readiness to reboot? If so, then the matter is settled. If not, such a command would be amazing! 🙏

In the screenshot, we can see a Kured config with the command that was working for a normal Fedora server node.
ksnip_20220126-142519

@mysticaltech
Copy link

mysticaltech commented Jan 26, 2022

Ah, but the reboot needs to be done by Zincati itself "finalize-and-reboot" as mentioned above, so it won't work in the current form!

Basically, you just need Kured to do the draining and inform Zincati when it can reboot.

@mysticaltech
Copy link

mysticaltech commented Jan 26, 2022

So the good news @lucab is that Kured does have a --reboot-command too, see Kured config. So basically, so in recap, we need two commands:

  • One to pass to --reboot-sentinel-command that would inform Kured of the willingness to reboot. And start the draining process etc.
  • And a second command --reboot-command passed from Kured to the node, to signal back to Zincadi, or the node itself to perform the finalize-and-reboot process (since the node has been made ready by Kured).

What do you think?

@lucab
Copy link
Author

lucab commented Jan 26, 2022

Yes, something like that.
Although Zincati and kured run in different namespaces, so permissions and bind-mounts will need to be figured out upfront (for FS and/or dbus).
And edge-triggers (i.e. commands) are quite poor primitives on the account of service restarts, so we tend to favor level-triggers and controller loops.
Overall this will need a new strategy in Zincati, but perhaps it can piggy-back on coreos/zincati#540.

@mysticaltech
Copy link

mysticaltech commented Jan 26, 2022

Sounds good! Though the terminology is a little over my head as I am new to CoreOS. Maybe "touching" respective empty files, could be enough to pass the info without worrying about namespaces and permissions.

  • --reboot-sentinel '/tmp/pending_finalize_reboot' (as soon as found Kured will drain the node)
  • --reboot-command 'touch /tmp/node_drained_ready_to_reboot' (Zincati would know it can proceed)

That would likely definitely work if coreos/zincati#540 is implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants