-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoreOS autoinstall creates huge number of XFS allocation groups #1183
Comments
Hi, thanks for filing this. I agree this is a problem. We should document this, but my initial take here is that:
We could add a built-in check that emits a warning to the console if it detects this situation. That said, another approach we could take here is to:
|
This should be an issue on any system that auto-grows an XFS filesystem on first boot, which is the case for at least the RHEL guest image (but not Fedora Cloud since it moved to btrfs) but likely a lot of other cloud images out there. |
A. consider Btrfs by default; knowledgeable users can make an XFS /var appropriate for the actual storage stack; XFS devs recommend a lifetime maximum of one order of magnitude for xfs_growfs, so if the XFS starts at 100G, it shouldn't be grown beyond 1T. Or if 10G then do not grow beyond 100G. Even one order of magnitude is considered suboptimal, it's really the top end of what's recommended. In this case it's, what, 3 orders magnitude? It's way outside the design. For sure rather than block replicating a file system, then growing it, XFS developers expect a new file system is created each time. The mkfs.xfs code makes the defaults adaptable for the storage stack actually being used. More importantly though, XFS developers expect XFS users to know these things.1 Users who can't evaluate fs technologies for their environment are unqualified. This is a long standing view of XFS is that it's an fs for experts. This really hasn't changed in 10 years. A non-optimal XFS can perform worse than either ext4 or Btrfs. Meanwhile Btrfs will do a 200M to 8EiB resize in two seconds. It can live shrink, perhaps uncommonly needed but not unheard of as a nice to have. Btrfs also supports live migration, e.g. |
We discussed this in the community meeting earlier this week. Barring us changing the default filesystem (I think that would need a separate ticket/discussion) we threw around a few ideas for how to handle this:
|
If you're hitting this today, I would emphasize you should likely be creating a separate /var partition on such large disks. |
@sandeen do you have any suggestions for this situation? |
A few thoughts ... At the highest level I agree that a 10T root filesystem is rarely optimal for anyone, and i.e. a separate Switching to ext4 may not help; tytso gave a talk about similar problems in ext4 at LSF last week. ("changing filesystem resize patterns." [1]) The That said, a a-few-gigs-to-several-terabytes growth will still have some suboptimal corners. |
From this week's meeting:
|
@sandeen Would you be open to making such a change to the xfs tools? |
In general I'm reluctant to turn sub-optimal configurations into hard failures; that leads to its own set of complications and bug reports. And that wouldn't actually solve the case reported here, would it? I think the end result would simply be that the coreos auto-installer fails on a 10T target device, right? |
https://lwn.net/Articles/894629/ just appeared |
If this is something you can do safely and efficiently (copy rootfs into RAM, create a filesystem, and copy it back out) why not do that unconditionally, and get a filesystem geometry appropriate for the device size? More generally, what is it about the coreos deployment strategy that requires copying a small fs-geometry image to a block device then growing it, as opposed to creating a device-sized filesystem on the target, and copying the root image contents into that? (I admittedly don't know a lot about coreos deployment methods or constraints, but would like to understand more.) |
We want to work on small cloud nodes too (and for that matter, quick disposable VMs run in qemu). Reprovisioning on boot adds latency and will likely fail on small VMs with just a few GB of RAM. To say it another way: If you have a large disk, we're going to assume you have a reasonable amount of RAM to go with it. |
I thought that option of growing rootfs to a certain extent and then creating separate partition and creating filesystem on it and mount on /var is reasonable too. I guess that's option D in the list. So come up with a threshold for primary partition size, say 100G. And if disk is bigger than that, then just create separate partition for rest of the space. Create filesystem and mount on /var. Do people see issues with this approach. |
I don't know a lot about image generation part. So I am assuming a lot. Assuming we are generating qcow2 images and booting from them. Assuming these images virtual sizes are very small, say 2G. I am wondering can we keep virtual sizes of these images bigger. Say 200GB. And then create filesystem on it. If I use qemu-nbd to access that image using /dev/nbd0, then it shows up as 200GB disk.
So creating xfs on 200G size disk consumes around 1.5GB of disk space. Well df shows 1.5GB consumed but when I look at qemu-image, it is 100M in size.
Now this seems to create allocation group of size rough 50G.
If this is true and if we grow image 1 order to say 2TB, it probably should be fine. Anything beyond that probably can be converted to /var partition. And qcow2 image size is still small enough for easy transport. Am I missing something |
That's already listed as option D. I think no one was a big fan of it in the meeting. I personally am not opposed but I prefer A for now myself.
The problem is though that people will deploy images to physical or virtual block devices that are smaller than that, and it will not work to have the filesystem write beyond the address space allocated to it. |
Copying existing file system in RAM and then recreating new filesystem and copying rootfs back into new filesystem and switching to it sounds interesting. That means you need to pack filesystem creation utilities and dependencies in initramfs. I guess that probably is not too bad. This feels like installer which creates a root filesystem and install the files. Just that installer might read rpms from disk while this mini installer will copy and read everything from initramfs.
Fair enough. Once we create filesystem with 200GB image size, that is essentially saying minimum size of disk is 200GB. And some people might not like that limitation. |
This is how Ignition based systems work. A key point here for example is that we support enabling TPM-bound LUKS for We've shipped the code to "copy rootfs to RAM for reprovisioning" since coreos/fedora-coreos-config@a444e69 This model is a key part how CoreOS works as identically as possible in cloud and bare metal deployments - both can be configured via Ignition in exactly the same way. (Unlike kickstart vs cloud-init) |
Well, we can enable |
This all said, while I am not a FS expert, ISTM it should be easier for the filesystem implementation to handle the "growfs while not mounted" case (in userspace even?) without a full copy. If filesystem implementations start supporting that in a more efficient way we can easily make use of that from the initramfs. Basically our case differs from the cloud-init and "generic data" cases in that we know we're offline. But, OTOH I can't say for sure whether writing and maintaining that code that only optimizes this case would be worth it. |
Looking at implementing this...I am actually kind of changing my mind a bit and thinking we should actually do option D "auto-create /var" instead. The thing is, that's what we would actually recommend for any nontrivial scale anyways. Either way...the code also seems really ugly without something like inject bits into the fetched Ignition or so. Which...hmm, I guess we could in theory do by mutating An aspect I am just remembering in this is that in e.g. Azure, IOPS scale based on the size of the disk. In OCP we end up provisioning large disks for control plane nodes primarily to get the IOPS for etcd. So it seems likely this path will trigger in a lot of those cases which...I think is likely good and correct, but something to bear in mind. |
If a btrfs by default proposal will be seriously considered, I'll file a separate ticket for it, and make the case. |
Note that Ignition currently lacks advanced support for btrfs; see coreos/ignition#815 and coreos/ignition#890. That doesn't necessarily prevent us from making it the default, though, since it's more relevant to users who want to create non-default btrfs filesystems. |
Worth discussing. Personally, I feel a bit uneasy about possibly having this kind of divergence on a per-node basis.
Yeah... was looking at this too. Modifying the Ignition config indeed feels hacky. I think trying to detect this in Maybe simplest is to detect this at diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-growfs.sh b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-growfs.sh
index d20b6a08..fa4326e8 100755
--- a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-growfs.sh
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-growfs.sh
@@ -111,7 +111,19 @@ wipefs -af -t "no${ROOTFS_TYPE}" "${src}"
# TODO: Add XFS to https://github.com/systemd/systemd/blob/master/src/partition/growfs.c
# and use it instead.
case "${ROOTFS_TYPE}" in
- xfs) xfs_growfs "${path}" ;;
+ xfs)
+ if ! xfs_will_grow_too_much; then
+ xfs_growfs "${src}"
+ else
+ source /usr/lib/ignition-ostree/transposefs-lib.sh
+ dev=$(activate_zram)
+ save_filesystem root
+ wipefs -a "${src}"
+ mkfs.xfs "${root_part}" -L root -m reflink=1
+ restore_filesystem root
+ deactivate_zram "${dev}"
+ fi
+ ;;
ext4) resize2fs "${src}" ;;
btrfs) btrfs filesystem resize max ${path} ;;
esac |
Why is it recommended to have a separate /var? I can think of full disk prevention, but that seems more cleanly done with cgroups on machine.slice |
Add a new transposefs unit: `autosave-xfs`. This unit runs after `ignition-disks` and `ignition-ostree-growfs,` but before the `restore` transposefs unit. If the XFS root was grown, it checks if the allocation group count (agcount) is within a reasonable amount (128 is chosen here). If it isn't, it saves the rootfs and reformats the filesystem. The `restore` unit will then restore it as usual. In the case of in-place reprovisioning like LUKS (i.e. where the partition table isn't modified by the Ignition config), the rootfs is still saved only once. Ideally, instead of adding a new transposefs unit, we would make it part of the initial `save` unit. But at that point, there's no way to tell whether we should autosave without gazing even more deeply into the Ignition config. We also don't want to unconditionally save the rootfs when we may not need it. Closes: coreos/fedora-coreos-tracker#1183
Option A implemented in coreos/fedora-coreos-config#2320. |
Add a new transposefs unit: `autosave-xfs`. This unit runs after `ignition-disks` and `ignition-ostree-growfs,` but before the `restore` transposefs unit. If the XFS root was grown, it checks if the allocation group count (agcount) is within a reasonable amount (128 is chosen here). If it isn't, it saves the rootfs and reformats the filesystem. The `restore` unit will then restore it as usual. In the case of in-place reprovisioning like LUKS (i.e. where the partition table isn't modified by the Ignition config), the rootfs is still saved only once. Ideally, instead of adding a new transposefs unit, we would make it part of the initial `save` unit. But at that point, there's no way to tell whether we should autosave without gazing even more deeply into the Ignition config. We also don't want to unconditionally save the rootfs when we may not need it. Closes: coreos/fedora-coreos-tracker#1183
Add a new transposefs unit: `autosave-xfs`. This unit runs after `ignition-disks` and `ignition-ostree-growfs,` but before the `restore` transposefs unit. If the XFS root was grown, it checks if the allocation group count (agcount) is within a reasonable amount (128 is chosen here). If it isn't, it saves the rootfs and reformats the filesystem. The `restore` unit will then restore it as usual. In the case of in-place reprovisioning like LUKS (i.e. where the partition table isn't modified by the Ignition config), the rootfs is still saved only once. Ideally, instead of adding a new transposefs unit, we would make it part of the initial `save` unit. But at that point, there's no way to tell whether we should autosave without gazing even more deeply into the Ignition config. We also don't want to unconditionally save the rootfs when we may not need it. Closes: coreos/fedora-coreos-tracker#1183
Add a new transposefs unit: `autosave-xfs`. This unit runs after `ignition-disks` and `ignition-ostree-growfs,` but before the `restore` transposefs unit. If the XFS root was grown, it checks if the allocation group count (agcount) is within a reasonable amount (128 is chosen here). If it isn't, it saves the rootfs and reformats the filesystem. The `restore` unit will then restore it as usual. In the case of in-place reprovisioning like LUKS (i.e. where the partition table isn't modified by the Ignition config), the rootfs is still saved only once. Ideally, instead of adding a new transposefs unit, we would make it part of the initial `save` unit. But at that point, there's no way to tell whether we should autosave without gazing even more deeply into the Ignition config. We also don't want to unconditionally save the rootfs when we may not need it. Closes: coreos/fedora-coreos-tracker#1183
The fix for this went into |
The fix for this went into |
The fix for this went into |
Add a new transposefs unit: `autosave-xfs`. This unit runs after `ignition-disks` and `ignition-ostree-growfs,` but before the `restore` transposefs unit. If the XFS root was grown, it checks if the allocation group count (agcount) is within a reasonable amount (128 is chosen here). If it isn't, it saves the rootfs and reformats the filesystem. The `restore` unit will then restore it as usual. In the case of in-place reprovisioning like LUKS (i.e. where the partition table isn't modified by the Ignition config), the rootfs is still saved only once. Ideally, instead of adding a new transposefs unit, we would make it part of the initial `save` unit. But at that point, there's no way to tell whether we should autosave without gazing even more deeply into the Ignition config. We also don't want to unconditionally save the rootfs when we may not need it. Closes: coreos/fedora-coreos-tracker#1183
Add a new transposefs unit: `autosave-xfs`. This unit runs after `ignition-disks` and `ignition-ostree-growfs,` but before the `restore` transposefs unit. If the XFS root was grown, it checks if the allocation group count (agcount) is within a reasonable amount (128 is chosen here). If it isn't, it saves the rootfs and reformats the filesystem. The `restore` unit will then restore it as usual. In the case of in-place reprovisioning like LUKS (i.e. where the partition table isn't modified by the Ignition config), the rootfs is still saved only once. Ideally, instead of adding a new transposefs unit, we would make it part of the initial `save` unit. But at that point, there's no way to tell whether we should autosave without gazing even more deeply into the Ignition config. We also don't want to unconditionally save the rootfs when we may not need it. Closes: coreos/fedora-coreos-tracker#1183
Add a new transposefs unit: `autosave-xfs`. This unit runs after `ignition-disks` and `ignition-ostree-growfs,` but before the `restore` transposefs unit. If the XFS root was grown, it checks if the allocation group count (agcount) is within a reasonable amount (128 is chosen here). If it isn't, it saves the rootfs and reformats the filesystem. The `restore` unit will then restore it as usual. In the case of in-place reprovisioning like LUKS (i.e. where the partition table isn't modified by the Ignition config), the rootfs is still saved only once. Ideally, instead of adding a new transposefs unit, we would make it part of the initial `save` unit. But at that point, there's no way to tell whether we should autosave without gazing even more deeply into the Ignition config. We also don't want to unconditionally save the rootfs when we may not need it. Closes: coreos/fedora-coreos-tracker#1183
https://lore.kernel.org/all/[email protected]/ was posted,
Yes, such a thing would be trivial for systems like CoreOS that use ignition in the initramfs, the change to add an However...today cloud-init builds (e.g. many stock Fedora/RHEL cloud images) don't do provisioning in the initramfs, but do so in the real root. |
Using the workaround in [1] until the issue is fixed in RHCOS. [1] coreos/fedora-coreos-tracker#1183 (comment)
Describe the bug
coreos resizing the filesystem during the install process results in a pathological amount of allocation groups. On a large disk (10TB) the agcount is >20k and the system is not bootable, with the initial mount timing out.
Reproduction steps
Steps to reproduce the behavior:
Expected behavior
System boots normally after install.
Actual behavior
https://access.redhat.com/solutions/5587281 describes this exact issue.
System details
proxmox
next
Ignition config
nothing filesystem related
Additional information
I don't think you need a 10tb disk to see the problem. The same image provisioned on a 200gb disk has an agcount of 430, which still seems too high.
The text was updated successfully, but these errors were encountered: