Add factory reset capability #399

bgilbert · 2020-02-26T04:39:35Z

Under the immutable infrastructure model, we encourage users to handle config changes by reprovisioning their systems, rather than using configuration management or manually modifying the running system. This is okay in virtualized setups or in the cloud, and is also okay on bare metal where PXE infrastructure or IPMI-assisted ISO boot is available. But on single-node bare metal setups without external infrastructure, manual reinstall is impractical. (See e.g. this discussion.)

For this use case, it might help to have a command which receives an Ignition config to use for reprovisioning, copies it into /boot, sets first-boot kargs, and reboots. Then the initramfs, before running the usual first-boot steps, would need to restore the system to a pristine state. That process might look a lot like #94, or might look more like clearing /var and resetting /etc from /usr/etc. If those are in non-standard locations, perhaps the user-mode command could add kargs which tell the initramfs where to find them.

Because we'd be deleting user customizations before rerunning Ignition, this approach wouldn't[*] get into unsupported territory of using Ignition for configuration management. It would slightly complicate OS maintenance, though. At present we can freely change the Ignition initrd glue in ways that are incompatible with old installs (e.g. by changing first-boot kargs) because old installs will never run Ignition again. Supporting factory reset of old nodes would presumably add some additional constraints.

[*] Presumably it'd be infeasible to delete every possible user customization, so that statement might be optimistic.

The text was updated successfully, but these errors were encountered:

bgilbert · 2020-02-27T00:26:30Z

Discussed in the community meeting today.

If we committed the Ignition config to the ostree repo, that could make this a bit easier.
coreos-installer used to be in the initramfs, which technically allowed every system to reinstall itself outright (assuming network access). We could investigate running the new coreos-installer from the initramfs again, but it requires lsblk, udevadm, and gpg.
We've discussed splitting the live PXE root squashfs out to a separate stage2 (Publish the initrd and rootfs images separately for PXE booting #390) which could be useful here: assuming network access, the existing installed kernel and initrd could boot themselves into a live PXE image and rerun coreos-installer from there. That doesn't let air-gapped systems reprovision without external assistance, though.
- Assuming network access, we could also fetch the image before the reboot and stash it somewhere on disk. But if the install fails and overwrites the image in the process, you don't get a second chance.
- To solve the air-gapped case, we could: 1) before reboot, assemble a live squashfs from the ostree we already have, 2) from the initramfs, mount the old root filesystem and copy the squashfs into RAM, 3) create a temporary partition at the end of the disk and copy the squashfs into it for safety, 4) run coreos-installer, and if it fails, recreate /boot and the safety partition, 5) reboot into the new system. That's complex though.
We didn't explore it in the meeting, but rerunning coreos-installer is not mandatory if there's a good way to delete user customizations from the existing system. We probably won't be able to get the system to 100% pristine state though.

There was a general sense that the feature would be nice, but there was also concern about implementation complexity depending on the approach taken.

jlebon · 2020-03-19T14:54:11Z

Note with the stage 2 approach, as long as the stage 2 binary is the initrd we append to the base one, it should contain both the osmet file and the root squashfs, so it should be possible to do the re-install completely offline.

jlebon · 2020-04-17T16:17:06Z

Related OSTree issue: ostreedev/ostree#1793

cgwalters · 2020-05-22T19:16:14Z

Also, if we shipped the .osmet file in e.g. /sysroot/fcos.osmet, then we could always support a factory reset back to the aleph image. If we wanted to support reset back to the current ostree commit we'd need a tool to fetch it out of band.

(This all said I think a lot of cases are going to be happy enough with the "ostree level" factory reset and not "wipe and replace disk image", particularly because the former can be made transactional)

cgwalters · 2021-04-09T14:25:41Z

As is today, it will work in many basic scenarios to just do:

$ unshare -m /bin/sh -c 'mount -o remount,rw /boot && touch /boot/ignition.firstboot'
$ reboot

A much stronger version would look like touch /run/factory-reset and we go back into the initramfs at shutdown time and do e.g. rm /var/* -rf && rsync -rlv --delete /usr/etc/ etc/.

The strongest version here of course is re-fetching and re-imaging the target disk from the initramfs, but that's even more involved.

jlebon · 2021-04-09T14:54:33Z

Also, if we shipped the .osmet file in e.g. /sysroot/fcos.osmet, then we could always support a factory reset back to the aleph image. If we wanted to support reset back to the current ostree commit we'd need a tool to fetch it out of band.

Truly shipping the osmet file in the image is impossible because it's a circular dependency. So it would probably instead require coreos-installer to propagate it there manually at install time. But the problem is that the osmet file is tied to the OSTree commit of the aleph version, so if the node upgraded at all, we would need to fetch the aleph OSTree commit. One thing we could easily do is have coreos-installer add a ref to it it's not GC'ed, but then every node is paying that storage cost.

bgilbert · 2021-04-09T17:49:43Z

$ unshare -m /bin/sh -c 'mount -o remount,rw /boot && touch /boot/ignition.firstboot'
$ reboot

This won't remove any existing customizations, so I don't think we should encourage it.

jlebon · 2021-10-25T19:35:00Z

A variation on #399 (comment) using kexec would be:

download the rootfs CPIO corresponding to the version we're at from stream metadata
concatenate base OSTree initrd with rootfs initrd and target Ignition config initrd
kexec with base OSTree kernel + concatenated initrd and --append 'coreos.inst.install_dev=... ... etc, to reinstall on the same device

Since the rootfs CPIO includes the osmet file, no network is required during install. But air-gapped systems will need a way to obtain the rootfs initrd in the first place.

Would require some tweaks to support pointing at a local Ignition config from coreos.inst.* kargs, but if we're wrapping all this in a e.g. coreos-installer reinstall, then we could also just generate a live Ignition config and do away with coreos.inst.* kargs entirely.

jlebon · 2021-12-13T17:25:40Z

A variation on #399 (comment) using kexec would be:

Opened a proof of concept for this in coreos/coreos-installer#712.

dustymabe · 2022-02-16T20:54:39Z

We discussed this in the community meeting today.

12:04:26    dustymabe | #agreed For our factory reset capability we'll use kexec and initially
                      | require either network access or local copies of the install media to
                      | exist. In the future we may generate intall media from the existing system.
                      | We also will limit our scope to just re-installing FCOS on FCOS and
                      | disallow/discourage running it from other distros.

ericcurtin · 2024-03-14T20:52:49Z

I notice kexec was brought up, I only started learning this recently, while kexec works great on x86 platforms, there's some embedded platforms it doesn't work on. Apple Silicon/Asahi being one for example

ericcurtin · 2024-03-14T21:06:50Z

As is today, it will work in many basic scenarios to just do:
$ unshare -m /bin/sh -c 'mount -o remount,rw /boot && touch /boot/ignition.firstboot'
$ reboot
A much stronger version would look like touch /run/factory-reset and we go back into the initramfs at shutdown time and do e.g. rm /var/* -rf && rsync -rlv --delete /usr/etc/ etc/.

The strongest version here of course is re-fetching and re-imaging the target disk from the initramfs, but that's even more involved.

I may look into this as it could be useful for an Automotive feature.

I think bootc will ultimately have the cleanest approach but it's hard to beat a neat tool like bootc that has an in-built installer right? 😄

How about this as an option. As hinted in to the above approach as a more involved solution...

In ostree-prepare-root check some directory in the sysroot for a file (/run/factory-reset is fine that persists reboots right?)

If that file exists, in the C code during initrd do:

rm -rf var/*
rm -rf etc
cp -r usr/etc etc

and continue boot, more complex than this of course, but the above would be the gist of things.

This wouldn't be a live factory reset of course, but live resets are different.

jlebon · 2024-03-14T21:34:52Z

I notice kexec was brought up, I only started learning this recently, while kexec works great on x86 platforms, there's some embedded platforms it doesn't work on. Apple Silicon/Asahi being one for example

Possibly there's kinks that need to be worked out there specific to Apple Silicon, but AFAICT it is supported on aarch64 in general. We also have a kdump test which has no arch restriction (and so runs on all the arches we currently support).

cgwalters · 2024-03-14T22:42:25Z

It's already possible to do a form of this via ostree admin deploy --no-merge - and the really nice thing about that is it becomes trivial to carry forward state you do want from /etc by just copying it into the new deployment's /etc.

It's also transactional.

Resetting /var...hmm; I'm not sure it needs to be part of ostree, or at least not part of ostree-prepare-root.service which already does too much. The tricky thing here is offering some mechanism to only reset certain state.

I think I'd argue that this can be done in the real root during shutdown (after we've finalized the new bootloader entry). Something we likely should do in this scenario though is rerun through the new ostree logic to restore the factory /var state. If the /var cleanup happens in the real root then arbitrary complex tooling can operate on it using the full real root userspace.

There are some other bits of state though:

origin file
kernel arguments (it'd be nice to retain the original install kargs for sure...)

For these...it'd definitely be nice to retain the "aleph"/originally-installed state of these.

Ironically our support for `--replace-mode=alongside` breaks when we're targeting an already extant ostree host, because when we first blow away the `/boot` directory, this means the ostree stack loses its knowledge that we're in a booted deployment, and will attempt to GC it... ostreedev/ostree-rs-ext@8fa019b is a key part of the fix for that. However, a notable improvement we can do here is to grow this whole thing into a real "factory reset" mode, and this will be a compelling answer to coreos/fedora-coreos-tracker#399 To implement this though we need to support configuring the stateroot and not just hardcode `default`. Signed-off-by: Colin Walters <[email protected]>

Ironically our support for `--replace-mode=alongside` breaks when we're targeting an already extant ostree host, because when we first blow away the `/boot` directory, this means the ostree stack loses its knowledge that we're in a booted deployment, and will attempt to GC it... ostreedev/ostree-rs-ext@8fa019b is a key part of the fix for that. However, a notable improvement we can do here is to grow this whole thing into a real "factory reset" mode, and this will be a compelling answer to coreos/fedora-coreos-tracker#399 To implement this though we need to support configuring the stateroot and not just hardcode `default`. Signed-off-by: Omer Tuchfeld <[email protected]>

Ironically our support for `--replace-mode=alongside` breaks when we're targeting an already extant ostree host, because when we first blow away the `/boot` directory, this means the ostree stack loses its knowledge that we're in a booted deployment, and will attempt to GC it... ostreedev/ostree-rs-ext@8fa019b is a key part of the fix for that. However, a notable improvement we can do here is to grow this whole thing into a real "factory reset" mode, and this will be a compelling answer to coreos/fedora-coreos-tracker#399 To implement this though we need to support configuring the stateroot and not just hardcode `default`. Signed-off-by: Colin Walters <[email protected]>

Ironically our support for `--replace-mode=alongside` breaks when we're targeting an already extant ostree host, because when we first blow away the `/boot` directory, this means the ostree stack loses its knowledge that we're in a booted deployment, and will attempt to GC it... ostreedev/ostree-rs-ext@8fa019b is a key part of the fix for that. However, a notable improvement we can do here is to grow this whole thing into a real "factory reset" mode, and this will be a compelling answer to coreos/fedora-coreos-tracker#399 To implement this though we need to support configuring the stateroot and not just hardcode `default`. Signed-off-by: Omer Tuchfeld <[email protected]>

cgwalters · 2024-11-07T17:01:37Z

FTR some things related to this continue on the bootc side in containers/bootc#404

Ironically our support for `--replace-mode=alongside` breaks when we're targeting an already extant ostree host, because when we first blow away the `/boot` directory, this means the ostree stack loses its knowledge that we're in a booted deployment, and will attempt to GC it... ostreedev/ostree-rs-ext@8fa019b is a key part of the fix for that. However, a notable improvement we can do here is to grow this whole thing into a real "factory reset" mode, and this will be a compelling answer to coreos/fedora-coreos-tracker#399 To implement this though we need to support configuring the stateroot and not just hardcode `default`. Signed-off-by: Omer Tuchfeld <[email protected]>

bgilbert added kind/new-feature meeting topics for meetings area/usability labels Feb 26, 2020

dustymabe removed the meeting topics for meetings label Feb 26, 2020

jlebon mentioned this issue Mar 19, 2020

Publish the initrd and rootfs images separately for PXE booting #390

Closed

travier mentioned this issue May 26, 2020

re-using existing disks when doing an install; re-creating partitions with Ignition #418

Closed

cgwalters mentioned this issue Jul 13, 2020

Bug 1855821: pkg/controller/render: log actions on machine configs openshift/machine-config-operator#1921

Closed

jlebon mentioned this issue Nov 2, 2020

Support kargs.d directories for default kargs (rebased) ostreedev/ostree#2217

Closed

cgwalters mentioned this issue Feb 23, 2021

Ignition Kernel Argument Support #752

Closed

cgwalters mentioned this issue Apr 9, 2021

supporting factory reset openshift/machine-config-operator#2520

Closed

bgilbert mentioned this issue May 28, 2021

Document automated reprovisioning for bare metal coreos/fedora-coreos-docs#299

Open

jlebon mentioned this issue Jun 16, 2021

Add support for minimal ISO packing/unpacking coreos/coreos-installer#559

Merged

cgwalters mentioned this issue Sep 15, 2021

35coreos-ignition: randomize partition GUIDs on first boot coreos/fedora-coreos-config#1207

Closed

jlebon mentioned this issue Dec 10, 2021

POC/RFC: Support reinstalls via kexec coreos/coreos-installer#712

Closed

bgilbert mentioned this issue Feb 3, 2022

Add filesystem cleanExcept directive to preserve wanted files coreos/ignition#1316

Closed

dustymabe mentioned this issue Feb 15, 2022

bare-metal: add small section about reinstallation coreos/fedora-coreos-docs#359

Merged

jlebon added the meeting topics for meetings label Feb 16, 2022

dustymabe added status/pending-action Needs action status/decided and removed meeting topics for meetings labels Feb 16, 2022

cgwalters mentioned this issue Mar 19, 2024

install: Support reinstallation (factory reset) containers/bootc#404

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add factory reset capability #399

Add factory reset capability #399

bgilbert commented Feb 26, 2020

bgilbert commented Feb 27, 2020

jlebon commented Mar 19, 2020

jlebon commented Apr 17, 2020

cgwalters commented May 22, 2020

cgwalters commented Apr 9, 2021

jlebon commented Apr 9, 2021

bgilbert commented Apr 9, 2021

jlebon commented Oct 25, 2021

jlebon commented Dec 13, 2021

dustymabe commented Feb 16, 2022

ericcurtin commented Mar 14, 2024

ericcurtin commented Mar 14, 2024 •

edited

Loading

jlebon commented Mar 14, 2024

cgwalters commented Mar 14, 2024 •

edited

Loading

cgwalters commented Nov 7, 2024

Add factory reset capability #399

Add factory reset capability #399

Comments

bgilbert commented Feb 26, 2020

bgilbert commented Feb 27, 2020

jlebon commented Mar 19, 2020

jlebon commented Apr 17, 2020

cgwalters commented May 22, 2020

cgwalters commented Apr 9, 2021

jlebon commented Apr 9, 2021

bgilbert commented Apr 9, 2021

jlebon commented Oct 25, 2021

jlebon commented Dec 13, 2021

dustymabe commented Feb 16, 2022

ericcurtin commented Mar 14, 2024

ericcurtin commented Mar 14, 2024 • edited Loading

jlebon commented Mar 14, 2024

cgwalters commented Mar 14, 2024 • edited Loading

cgwalters commented Nov 7, 2024

ericcurtin commented Mar 14, 2024 •

edited

Loading

cgwalters commented Mar 14, 2024 •

edited

Loading