Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Make possible to pin target release #947

Open
sgohl opened this issue Sep 1, 2021 · 6 comments
Open

Feature Request: Make possible to pin target release #947

sgohl opened this issue Sep 1, 2021 · 6 comments

Comments

@sgohl
Copy link

sgohl commented Sep 1, 2021

Describe the enhancement

To ensure a stable and consistent working landscape, it would be very helpful to pin coreos to a specific version, to which zincati is allowed to upgrade to.

Otherwise, any system would possibly have any version installed, which gets more complicated with schedules and different wariness settings

If I encounter a bug in a new release, I want to prevent all systems to be able to upgrade to this version.
Rollback is not sufficient in this case because it can not be done preventively and only for every machine one-by-one.

I wish this could be made possible with a dropin file into /etc/zincati/config.d like this:

[updates]
max_target_release = "34.20210808.3.0"

If this setting does not exist, nothing will be changed for anyone.

And yes i understand, that you actually want every system to be most up to date, but it's not your head to be hung when systems crash, so please let us decide on our own and have control over our lifecycle management :)

System details

  • Bare Metal/QEMU/AWS/GCP/etc. -> any
  • Fedora CoreOS version -> any

Additional information
n/a

Edit: To prevent you radically deny that as a whole, i could think of a compromise in a way that, lets say, it has a limit of x number of releases we are allowed to skip or it will print a warning at motd or something like that ...

@dustymabe
Copy link
Member

hey @sgohl, thanks for the feature request. Our expert on this topic will be back next week so we'll probably discuss this in next week's meeting.

@dustymabe dustymabe added the meeting topics for meetings label Sep 1, 2021
@travier
Copy link
Member

travier commented Sep 1, 2021

Might be similar to coreos/zincati#245 & coreos/zincati#540

@lucab
Copy link
Contributor

lucab commented Sep 16, 2021

@sgohl thanks for the report. This looks like an interesting RFE at its heart, but it possibly needs to be refined/scoped a bit in order to turn it into a viable implementation.

I'd start by putting aside the initial config.d proposal for now. Among other things, the content in /etc gets versioned with OS deployment so upgrades/rollbacks are going to wreak havoc with any kind of rolling/live data.

Taking a step back, it would be useful to get a better view on the actual problem and the surrounding environment you have at hand.
It sounds like you are trying to steer updates through a fleet of nodes (i.e. not handling a single machine), am I reading this right?
And you are looking for a mechanism to obtain homogenous OS versions, correct? Plus some kind of oracle / canary system to select viable update targets?

If that is the case, we should probably drill down on how cluster coordination is performed in your environment. Specifically, whether there is a central coordinator pushing live signals to all nodes, or whether each node is individually pulling fresh details from a coordinator.

@lucab lucab removed the meeting topics for meetings label Sep 16, 2021
@sgohl
Copy link
Author

sgohl commented Sep 17, 2021

Hi and many thanks for your interest on this case!

steer updates through a fleet of nodes (i.e. not handling a single machine), am I reading this right?
And you are looking for a mechanism to obtain homogenous OS versions, correct? Plus some kind of oracle / canary system to select viable update targets?

yes, my case is multiple datacenters with a lots of single nodes, loosely coupled node-groups, many clusters with variable node-count each (nodes join and leave, think of CloudFormation), all in all about 600-1000 vms and bare-metals - almost everything Fedora CoreOS
And I mainly want all "important" nodes having a specific, pinned CoreOS release to prevent facing a known bug on auto-update nodes one by one and doing rollbacks (god no). edit: I am widely still using Docker Swarm, which I suppose being deprecated with cgroups2 in the future, while not having migrated to k8s this is definitely a point I'm afraid of ^^

This is highly related to having no manual update mechanism anymore.

Besides many ephemeral and testing machines with different streams with always/immediate updates and higher available nodes with scheduled updates like staging systems, the problem starts with HA systems having absolute no update schedule at all because not even 10sec downtime is acceptable and some applications need manual preparing and intervention. Schedules and Fleetlock is not enough by far.

Instant-update like it was on old coreos is highly missed - i would pay to have this back again, and you may think this would be a bad idea against the concept, because it causes systems to be older, but no, the opposite is the case. When I could fire an immediate-update with forced reboot, I could for example

  • control organized group-of-nodes to be updated at any time I want (again, schedules not sufficient)
  • let system users choose the best time for a update theirselves, which will be more often than never, because its off right now
  • include this in host provision scripts (which I cant, because zincati is so asynchronous and won't reboot on open tty or running rpm-ostree actions etc), I tried hours and hours to solve this problem, with lockfiles and checks, but without success (always facing a hen-egg situation)

Auto-Updates are nice for systems where high availability is not a big thing.

Some applications often need seconds to minutes to go back to full availability, caused by things like health checks need their time to drain backends, schedulers need to move services and prepare/pull image on new host right before a zincati update, etc
Intelligence is needed where we dont have any, so we have to do certain things manually, it's just that, real world issues. High availability is just more important

the content in /etc gets versioned with OS deployment so upgrades/rollbacks are going to wreak havoc with any kind of rolling/live data.

even if we put it via ignition? because we do this anyway with a lots of other files, and update-strategy is anyway already modified via ignition

putting aside the initial config.d proposal

yes, that would be very static if you see it as just this. I'd put a consul-watcher service then, or another simple approach, to have it centrally managable.

@dustymabe
Copy link
Member

all in all about 600-1000 vms and bare-metals - almost everything Fedora CoreOS

❤️

@sgohl
Copy link
Author

sgohl commented Sep 21, 2021

out of curiosity, couldn't we have such thing as a cincinnati-proxy application for the purpose of "lying" what the current release is? :D

unfortunately, this page (https://github.com/coreos/zincati/blob/main/docs/development/cincinnati/protocol.md)
is not really helpful how a request should look like.

But if we had a web-app act like a proxy-server which intercepts the response from the "real" cincinnati server, we could build a webapp to pin specific servers to a release (with optional expiration) and change the release value on-the-fly before back-relaying the response to our zincati-client.

heart

yeees, i love coreos 🥇

aleskandro added a commit to aleskandro/openshift-release that referenced this issue Feb 23, 2023
Some servers' firmware push any new detected boot options to the tail of the boot order.
When other boot options are present and bootable, such a server will boot from them instead of the new one.
As a (temporary?) workaround, we manually add the boot option.
NOTE: it's assumed that old OSes boot options are removed from the boot options list during the wipe operations.
 xrefs: https://bugzilla.redhat.com/show_bug.cgi?id=1997805
        coreos/fedora-coreos-tracker#946
        coreos/fedora-coreos-tracker#947
openshift-merge-robot pushed a commit to openshift/release that referenced this issue Feb 23, 2023
* Support Dell IPMI power commands

On Dell servers, `ipmi power (off|on|reset)` returns errors when the host is in a state that doesn't allow the requested transition. Enforcing two commands (on + off) instead of reset, and ignoring any power off errors to ignore those validation errors.

* Set the efi boot order after installing RHCOS in UPI/UEFI/PXE scenarios

Some servers' firmware push any new detected boot options to the tail of the boot order.
When other boot options are present and bootable, such a server will boot from them instead of the new one.
As a (temporary?) workaround, we manually add the boot option.
NOTE: it's assumed that old OSes boot options are removed from the boot options list during the wipe operations.
 xrefs: https://bugzilla.redhat.com/show_bug.cgi?id=1997805
        coreos/fedora-coreos-tracker#946
        coreos/fedora-coreos-tracker#947
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants