-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support mirroring /boot, ESP, BIOS bootloader on first boot #718
Conversation
Do you have any concerns about the complexity of the config needed for this? It looks like a lot of things that could be mistyped. :) And even if we have FCC sugar for it, in the end we're still on the hook for maintaining that interface at the Ignition level. I wonder if we should instead just key off of something simpler in the config. For example, we can check if the config is trying to mirror the rootfs on RAID1. E.g. the logic could be: "if the user wants to make the root partition from the primary boot disk part of a RAID1, then assume that they want full disk RAID1". Because the use case for putting just the rootfs on RAID1 while leaving out the other partitions seems dubious. Simplifying the interface also means we get more flexibility over how it's done, and we can ensure better consistency of state when thinking about upgrades. |
No, not at all. Ignition is designed as a pretty low-level interface, with no magic in the spec and hopefully very little magic in the surrounding glue. All of the inferences made by the glue logic are a small logical leap: if you make a boot filesystem you probably want it copied over; if you make an ESP or BIOS-BOOT partition you probably want it copied. The largest inference is that we should copy the boot sector whenever we're copying the BIOS-BOOT partition, but I think that follows pretty naturally from how BIOS booting works. We'll sugar this down to a couple lines of FCC for the primary use case, and OCP can declare unsupported anything that doesn't use the official sugar. But a benefit of this approach is that there's no narrowly-scoped magic "peephole optimization", as it were. An FCOS user who wants to do something clever (boot RAID 1 + root RAID 5 or whatever) retains the full ability to do that, since the config just specifies the desired disk layout in the natural way. |
Proposal in coreos/enhancements#3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skimmed; seems sane.
Maybe in the future we reimplement this in rdcore
but seems OK for now.
overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
Outdated
Show resolved
Hide resolved
overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
Outdated
Show resolved
Hide resolved
overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
Outdated
Show resolved
Hide resolved
Ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, but LGTM overall! Did you successfully test this in a 4Kn RAID setup as well?
overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
Outdated
Show resolved
Hide resolved
overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
Outdated
Show resolved
Hide resolved
overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
Outdated
Show resolved
Hide resolved
We'll be generalizing the rootfs save/restore code to support saving and restoring other partitions. Generalize the name to "transposefs" and move the saved rootfs data to /run/ignition-ostree-transposefs/root.
Add function to generate a partial jq query string to find wiped filesystems.
If the Ignition config creates any BIOS Boot partitions, save the existing BIOS-BOOT contents and the corresponding boot sector before ignition-disks (in case the partition is overwritten) and copy them to the new partitions (and corresponding disks) afterward. Also verify that the offset of the new BIOS-BOOT partitions matches the old one, since otherwise GRUB will fail when it tries to use them. We don't require the config to create a BIOS-BOOT RAID array because the OS doesn't use or modify the BIOS-BOOT partition at runtime.
If the Ignition config creates any PowerPC PReP partitions, save the existing PReP contents before ignition-disks (in case the partition is overwritten) and copy them to the new partitions afterward. We don't require the config to create a PReP RAID array because the OS doesn't use or modify the PReP partition at runtime.
Updated, and tested on 4Kn. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
This has come up a few times on linux-raid@ and upstream developers have consistently been critical of the idea of using mdadm, any metadata version, for use with the EFI System partition. There's no guarantee the firmware itself won't write to the ESP, including by some other EFI program. In this case, the raid will become broken and it's not repairable. I think keeping ESPs synchronized should be the responsibility of something like bootupd. An alternative might be firmware RAID, which mdadm also supports. As for $BOOT volume, upstream GRUB puts grubenv here. Fedora it's here for BIOS systems, where UEFI systems put it on the ESP. The grubenv is used in Fedora for the GRUB hidden menu feature. If GRUB knows grubenv is on md raid (or Btrfs) it will refuse to write to it, to avoid causing an inconsistent state. But if GRUB doesn't know it's md raid, it'll permit writes to grubenv. This probably isn't that bad of an inconsistency, because the way GRUB writes to grubenv is just by overwriting only the two 512 blocks making up grubenv, there's no fs metadata update at all. The bigger concern with metadata 1.0 has always been that it invites inadvertent mounting of the member device, rather than the array device. Once this happens, again the raid is broken and it's not reversible or repairable.
If GRUB doesn't know $BOOT is an mdadm device, how does fallback work when there's a read error or the device is missing? I'm expecting the point of going to the trouble of making $BOOT raid1, is if there's a problem with a member device, the bootloader can automatically use the other, and thus still boot the system. That's built into the GRUB mdraid1x.mod. I think it's better to make the prefix the md device, and expect GRUB knows the true nature of the stack which is that $BOOT is an array device. And then use mdadm metadata version 1.2 which is both the recommended and default version, because it prevents the inadvertent use of the md member device. |
Thanks for the comments!
An earlier draft proposed to maintain independent ESPs on each disk, but that would make it infeasible to mount "the ESP" inside the OS. Periodic RAID resync would still fix any breakage, no?
Fedora CoreOS (and RHEL CoreOS) doesn't use that feature. We're reading the grubenv but nothing in our configs writes to it.
We always boot from the first disk. If the first disk is missing, the second disk becomes the first disk. |
Right, but the firmware might not. |
Yep, I understand. This immediately exposes the unfortunate paradigm of having the ESP persistently mounted. It was always a bad idea, but we did it because we didn't have a smarter way of doing it. On Windows and Mac OS, it is never persistently mounted, it is never exposed to the user at all in any way. The thing that "owns" the ESP, for modifications and updates, is responsible for mounting this file system, making changes, then unmounting it. So again, I'd say this is the realm of bootupd, and/or maybe fwupd. And we should stop putting the ESP in fstab. In the fwupd case, it could do some test: prefer the ESP listed first in NVRAM boot order, mount that ESP and check if it has enough free space, if not try the other one and set a "boot next" NRAM entry. Maybe clean-up/garbage collection needs to go in there somewhere too. For bootupd it might be simpler, have an ESP files list on sysroot, and then use that as the source of authority, and the ESPs are just clones of that and should always be identical (at least, their directory on the ESP should be identical, there may also be a question about BLS directory... which I'll ignore for now).
Nope. It's ambiguous which drive is correct. No checksums. In the degraded array case (legit degraded assembly and still mounting the md array device) the event count in the active mdadm super block for the active md member device is updated. So it is determinable how to scrub and "catch up" the device with the lower event count. But in this example, where writes happen outside the mdadm infrastructure, the mdadm superblocks are still identical. A scrub repair will make things worse, it will actually break both of the ESPs because, without checksums or versions, it just randomly picks a block that's assumed to be correct, and it overwrites the mismatching block. And it's not consistent how it does this.
OK it may not be an issue for CoreOS now. But there's a pre-proposal for Silverblue to rebase on CoreOS and right now it does use this feature. There are any number of reasons why grubenv design is suboptimal, so really the solution is to fix that but ... resources.
The fallback is built into GRUB's mdraid support. It handles degraded operations. I'm not sure what "second disk becomes the first" looks like in grub.cfg. But, if it's possible to script the fallback, I'm still skeptical about the handling of a failed drive that isn't missing, rather it just spews zeros or garbage, a failure mode pretty common with SSDs. |
Okay, thanks for pursuing this. I'm seeing three issues:
|
That's less bad than I thought, but I agree ambiguity and risk remain. This Workstation WG issue starts out about grubenv and /boot on Btrfs, but eventually comes around to drawing on bootupd as a possible way to decouple /boot and /boot/efi from RPM, i.e. a single source of truth on sysroot "how to boot the system". And likewise simplify boot and make it more reliable. My pipe dream is that ESP, BIOS Boot, and boot partitions become the sort of plumbing that isn't user facing at all, whether at install time or repair/replacement time. |
Pipe dream cont'd ... In the simple single disk case, only bootupd/fwupd touching either But in the dual ESP case, systemd ignores extra ESPs. It only automounts the ESP on the drive that the bootloader says it used for booting. So now what? (a) teach systemd about multiple ESPs? (b) inhibit systemd ESP automount, and have bootupd/fwupd deal with all of it? (e.g. mount each of them in turn to some location in /run, do what needs to be done, then umount them). |
We don't ship the gpt-auto-generator. The plan is to pursue option (b); see for example coreos/bootupd#127. |
ci: use the RHEL 8.5 repos on the mirror
Supporting redundant bootable disks for coreos/fedora-coreos-tracker#581. This change supports the following:
/boot
and/boot/efi
to RAID 1 volumes if the Ignition config has a filesystem with aboot
orEFI-SYSTEM
label andwipe_filesystem: true
, similar to how we move the contents of the root filesystem. Because BIOS GRUB is configured to setprefix
to the first disk, it must not have MD-RAID support preloaded (it currently does not). The MD-RAID superblocks must be at the end of the component partitions (superblock format 1.0) so BIOS GRUB resp. the UEFI firmware can treat/boot
resp./boot/efi
as normal filesystems.Design document in coreos/enhancements#3. This functionality can actually be used to completely repartition the boot disk, since we copy everything into RAM before the Ignition disks stage. The only requirement is that BIOS-BOOT starts at the same offset (which we check for). The corresponding FCC sugar is in coreos/butane#162.
Test with:
Drop
-bios /usr/share/edk2/ovmf/OVMF_CODE.fd
to test in BIOS mode. Drop-drive if=virtio,file=./fedora*.qcow2
to test a failure of the first drive.Use the following FCC (assumes coreos/butane#162):