Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreOS autoinstall creates huge number of XFS allocation groups #1183

Closed
nivekuil opened this issue Apr 24, 2022 · 35 comments · Fixed by coreos/fedora-coreos-config#2320
Closed
Labels
jira for syncing to jira kind/bug

Comments

@nivekuil
Copy link

nivekuil commented Apr 24, 2022

Describe the bug
coreos resizing the filesystem during the install process results in a pathological amount of allocation groups. On a large disk (10TB) the agcount is >20k and the system is not bootable, with the initial mount timing out.

Reproduction steps
Steps to reproduce the behavior:

  1. autoinstall (--dest-ignition/--dest-device) coreos on a VM with a 10TB /dev/vda
  2. the installer runs succesfully, taking a while to do some resizing
  3. boot and reboot, mounting xfs times out, not bootable

Expected behavior
System boots normally after install.

Actual behavior
https://access.redhat.com/solutions/5587281 describes this exact issue.
System details

  • Bare Metal/QEMU/AWS/GCP/etc.
    proxmox
  • Fedora CoreOS version
    next

Ignition config
nothing filesystem related

Additional information
image

I don't think you need a 10tb disk to see the problem. The same image provisioned on a 200gb disk has an agcount of 430, which still seems too high.

@cgwalters
Copy link
Member

cgwalters commented Apr 24, 2022

Hi, thanks for filing this. I agree this is a problem.

We should document this, but my initial take here is that:

  • Fedora CoreOS defaults to a single partition so that you don't need to think about disk space management by default, particularly at smaller scales
  • If you have > 100G of disk space, and particularly anything 1TB or beyond, you should make a partition for /var at least that holds most of your data.

We could add a built-in check that emits a warning to the console if it detects this situation.

That said, another approach we could take here is to:

  • Detect on boot if the disk would grow "significantly" (needs input from filesystem (xfs particularly) developers for what "significantly" is)
  • If so, perform the same flow we do as if one is using LUKS or switching to e.g. ext4/btrfs on boot, which is (in the initramfs, before switchroot) to copy the rootfs into RAM, then create a new filesystem at the full target size, then copy the rootfs back

@dustymabe dustymabe added the meeting topics for meetings label Apr 25, 2022
@jlebon
Copy link
Member

jlebon commented Apr 25, 2022

This should be an issue on any system that auto-grows an XFS filesystem on first boot, which is the case for at least the RHEL guest image (but not Fedora Cloud since it moved to btrfs) but likely a lot of other cloud images out there.

@cmurf
Copy link

cmurf commented Apr 26, 2022

A. consider Btrfs by default; knowledgeable users can make an XFS /var appropriate for the actual storage stack;
B. ask XFS devs to make XFS more like Btrfs with respect to optimal growing and shrinking.

XFS devs recommend a lifetime maximum of one order of magnitude for xfs_growfs, so if the XFS starts at 100G, it shouldn't be grown beyond 1T. Or if 10G then do not grow beyond 100G. Even one order of magnitude is considered suboptimal, it's really the top end of what's recommended. In this case it's, what, 3 orders magnitude? It's way outside the design. For sure rather than block replicating a file system, then growing it, XFS developers expect a new file system is created each time. The mkfs.xfs code makes the defaults adaptable for the storage stack actually being used.

More importantly though, XFS developers expect XFS users to know these things.1 Users who can't evaluate fs technologies for their environment are unqualified. This is a long standing view of XFS is that it's an fs for experts. This really hasn't changed in 10 years. A non-optimal XFS can perform worse than either ext4 or Btrfs.

Meanwhile Btrfs will do a 200M to 8EiB resize in two seconds. It can live shrink, perhaps uncommonly needed but not unheard of as a nice to have. Btrfs also supports live migration, e.g. btrfs replace. You keep overlayfs reflink optimizations. As for carving out an explanation for this hybrid approach: Fedora CoreOS is making the root file system Btrfs because it's a complete hands off approach for all users as we continue to grow our market, while still actively recommending user create a separate /var on XFS volume for the typical Core OS workloads we see in the real world. Something like that.

@dustymabe
Copy link
Member

We discussed this in the community meeting earlier this week. Barring us changing the default filesystem (I think that would need a separate ticket/discussion) we threw around a few ideas for how to handle this:

  • A. Autodetect and correct it in the initramfs by triggering a root reprovision.
  • B. Don't correct automatically, but warn the user of potential issue (this is the case where the FS can still mount).
    • The suggestion was to put something in the motd, but motd isn't necessarily a good mechanism for warning because people might not see it.
  • C. Have coreos-installer warn the user
    • This would require coreos-installer to have some insight into the Ignition config
    • It also doesn't apply to systems that don't use the bare metal workflow.
  • D. Create a /var/ partition by default if we detect this potentially degraded state.
    • This could be interesting, but adds "magic".
  • E. Autogrow the partition, but not past the point where degradation would happen.
    • This could lead to user confusion.
  • F. Make XFS tools fail if the user asks it to do something that would lead to this state.
    • We would still need to handle this somehow, but at least no users would have ever got to a point where they had to root cause the issue.

@cgwalters
Copy link
Member

If you're hitting this today, I would emphasize you should likely be creating a separate /var partition on such large disks.

@miabbott
Copy link
Member

@sandeen do you have any suggestions for this situation?

@sandeen
Copy link

sandeen commented May 11, 2022

A few thoughts ...

At the highest level I agree that a 10T root filesystem is rarely optimal for anyone, and i.e. a separate /var provisioned on the target would make much more sense. If coreos is automatically creating a multi-terabyte root fs, I would try to avoid that, particularly if it is doing so by growing the filesystem by several orders of magnitude. Could coreos do this by default?

Switching to ext4 may not help; tytso gave a talk about similar problems in ext4 at LSF last week. ("changing filesystem resize patterns." [1])

The inobtcount option in mkfs.xfs (available as of xfsprogs-5.10.0, enabled by default as of xfsprogs-5.15.0) will greatly reduce mount times for a filesystem with many AGs and the free inode btree enabled (see mkfs.xfs man page for details). That should solve the problem as reported.

That said, a a-few-gigs-to-several-terabytes growth will still have some suboptimal corners.

[1] https://lwn.net/Articles/894629/

@travier
Copy link
Member

travier commented May 11, 2022

From this week's meeting:

  * AGREED: Given that we have a valid and recommended workaround for
    this issue, we will investigate option A (adding auto-detection and
    auto re-provisioning). We will reach out to XFS folks to get a
    better understanding of our options and to see if F is also doable.
    (travier, 17:08:22)

@travier travier removed the meeting topics for meetings label May 11, 2022
@travier
Copy link
Member

travier commented May 11, 2022

  • F. Make XFS tools fail if the user asks it to do something that would lead to this state.

@sandeen Would you be open to making such a change to the xfs tools?

@sandeen
Copy link

sandeen commented May 11, 2022

  • F. Make XFS tools fail if the user asks it to do something that would lead to this state.

@sandeen Would you be open to making such a change to the xfs tools?

In general I'm reluctant to turn sub-optimal configurations into hard failures; that leads to its own set of complications and bug reports. And that wouldn't actually solve the case reported here, would it? I think the end result would simply be that the coreos auto-installer fails on a 10T target device, right?

@cgwalters
Copy link
Member

https://lwn.net/Articles/894629/ just appeared

@sandeen
Copy link

sandeen commented May 11, 2022

* If so, perform the same flow we do as if one is using LUKS or switching to e.g. ext4/btrfs on boot, which is (in the initramfs, before switchroot) to copy the rootfs into RAM, then create a _new_ filesystem at the full target size, then copy the rootfs back

If this is something you can do safely and efficiently (copy rootfs into RAM, create a filesystem, and copy it back out) why not do that unconditionally, and get a filesystem geometry appropriate for the device size?

More generally, what is it about the coreos deployment strategy that requires copying a small fs-geometry image to a block device then growing it, as opposed to creating a device-sized filesystem on the target, and copying the root image contents into that? (I admittedly don't know a lot about coreos deployment methods or constraints, but would like to understand more.)

@cgwalters
Copy link
Member

cgwalters commented May 11, 2022

If this is something you can do safely and efficiently (copy rootfs into RAM, create a filesystem, and copy it back out) why not do that unconditionally, and get a filesystem geometry appropriate for the device size?

We want to work on small cloud nodes too (and for that matter, quick disposable VMs run in qemu). Reprovisioning on boot adds latency and will likely fail on small VMs with just a few GB of RAM.

To say it another way: If you have a large disk, we're going to assume you have a reasonable amount of RAM to go with it.

@bgilbert bgilbert added the status/pending-action Needs action label May 12, 2022
@rhvgoyal
Copy link

I thought that option of growing rootfs to a certain extent and then creating separate partition and creating filesystem on it and mount on /var is reasonable too. I guess that's option D in the list.

So come up with a threshold for primary partition size, say 100G. And if disk is bigger than that, then just create separate partition for rest of the space. Create filesystem and mount on /var.

Do people see issues with this approach.

@rhvgoyal
Copy link

rhvgoyal commented May 12, 2022

I don't know a lot about image generation part. So I am assuming a lot.

Assuming we are generating qcow2 images and booting from them. Assuming these images virtual sizes are very small, say 2G.

I am wondering can we keep virtual sizes of these images bigger. Say 200GB. And then create filesystem on it. If I use qemu-nbd to access that image using /dev/nbd0, then it shows up as 200GB disk.

$ qemu-img create -f qcow2 test-image.qcow2 200G
$ qemu-nbd -c /dev/nbd0 test-image.qcow2
$ mkfs.xfs /dev/nbd0
$ mount /dev/nbd0 /mnt/test-xfs
$ df -h
/dev/nbd0            200G  1.5G  199G   1% /mnt/test-xfs

So creating xfs on 200G size disk consumes around 1.5GB of disk space. Well df shows 1.5GB consumed but when I look at qemu-image, it is 100M in size.

-rw-r--r--. 1 root root 101M May 12 13:39 test-image.qcow2

$qemu-img info test-image.qcow2 
image: test-image.qcow2
file format: qcow2
virtual size: 200 GiB (214748364800 bytes)
disk size: 102 MiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false

Now this seems to create allocation group of size rough 50G.

meta-data=/dev/nbd1              isize=512    agcount=4, agsize=13107200 blks

If this is true and if we grow image 1 order to say 2TB, it probably should be fine. Anything beyond that probably can be converted to /var partition.

And qcow2 image size is still small enough for easy transport.

Am I missing something

@cgwalters
Copy link
Member

So come up with a threshold for primary partition size, say 100G. And if disk is bigger than that, then just create separate partition for rest of the space. Create filesystem and mount on /var.

That's already listed as option D. I think no one was a big fan of it in the meeting. I personally am not opposed but I prefer A for now myself.

I am wondering can we keep virtual sizes of these images bigger. Say 200GB.

The problem is though that people will deploy images to physical or virtual block devices that are smaller than that, and it will not work to have the filesystem write beyond the address space allocated to it.

@rhvgoyal
Copy link

So come up with a threshold for primary partition size, say 100G. And if disk is bigger than that, then just create separate partition for rest of the space. Create filesystem and mount on /var.

That's already listed as option D. I think no one was a big fan of it in the meeting. I personally am not opposed but I prefer A for now myself.

Copying existing file system in RAM and then recreating new filesystem and copying rootfs back into new filesystem and switching to it sounds interesting. That means you need to pack filesystem creation utilities and dependencies in initramfs. I guess that probably is not too bad.

This feels like installer which creates a root filesystem and install the files. Just that installer might read rpms from disk while this mini installer will copy and read everything from initramfs.

I am wondering can we keep virtual sizes of these images bigger. Say 200GB.

The problem is though that people will deploy images to physical or virtual block devices that are smaller than that, and it will not work to have the filesystem write beyond the address space allocated to it.

Fair enough. Once we create filesystem with 200GB image size, that is essentially saying minimum size of disk is 200GB. And some people might not like that limitation.

@cgwalters
Copy link
Member

Copying existing file system in RAM and then recreating new filesystem and copying rootfs back into new filesystem and switching to it sounds interesting. That means you need to pack filesystem creation utilities and dependencies in initramfs. I guess that probably is not too bad.

This is how Ignition based systems work. A key point here for example is that we support enabling TPM-bound LUKS for / even in cloud instances. That's not something one can do on cloud-init based systems where the instance config is only fetched in the middle of the real boot. Ignition runs in the initramfs.

We've shipped the code to "copy rootfs to RAM for reprovisioning" since coreos/fedora-coreos-config@a444e69

This model is a key part how CoreOS works as identically as possible in cloud and bare metal deployments - both can be configured via Ignition in exactly the same way. (Unlike kickstart vs cloud-init)

@cgwalters
Copy link
Member

The inobtcount option in mkfs.xfs (available as of xfsprogs-5.10.0, enabled by default as of xfsprogs-5.15.0) will greatly reduce mount times for a filesystem with many AGs and the free inode btree enabled (see mkfs.xfs man page for details). That should solve the problem as reported.

Well, we can enable -m inobtcount=1 on our own too (although, I see the RHEL8 mkfs.xfs doesn't have that) - so for that reason and as I understand things we actually really do want to solve the pathological number of AGs, so I think we should focus on either option A or D.

@cgwalters
Copy link
Member

This all said, while I am not a FS expert, ISTM it should be easier for the filesystem implementation to handle the "growfs while not mounted" case (in userspace even?) without a full copy. If filesystem implementations start supporting that in a more efficient way we can easily make use of that from the initramfs.

Basically our case differs from the cloud-init and "generic data" cases in that we know we're offline. But, OTOH I can't say for sure whether writing and maintaining that code that only optimizes this case would be worth it.

@cgwalters
Copy link
Member

Looking at implementing this...I am actually kind of changing my mind a bit and thinking we should actually do option D "auto-create /var" instead. The thing is, that's what we would actually recommend for any nontrivial scale anyways.

Either way...the code also seems really ugly without something like inject bits into the fetched Ignition or so. Which...hmm, I guess we could in theory do by mutating /run/ignition.json but, ew?

An aspect I am just remembering in this is that in e.g. Azure, IOPS scale based on the size of the disk. In OCP we end up provisioning large disks for control plane nodes primarily to get the IOPS for etcd. So it seems likely this path will trigger in a lot of those cases which...I think is likely good and correct, but something to bear in mind.

@cmurf
Copy link

cmurf commented May 17, 2022

If a btrfs by default proposal will be seriously considered, I'll file a separate ticket for it, and make the case.

@bgilbert
Copy link
Contributor

Note that Ignition currently lacks advanced support for btrfs; see coreos/ignition#815 and coreos/ignition#890. That doesn't necessarily prevent us from making it the default, though, since it's more relevant to users who want to create non-default btrfs filesystems.

@jlebon
Copy link
Member

jlebon commented May 18, 2022

Looking at implementing this...I am actually kind of changing my mind a bit and thinking we should actually do option D "auto-create /var" instead. The thing is, that's what we would actually recommend for any nontrivial scale anyways.

Worth discussing. Personally, I feel a bit uneasy about possibly having this kind of divergence on a per-node basis.

Either way...the code also seems really ugly without something like inject bits into the fetched Ignition or so. Which...hmm, I guess we could in theory do by mutating /run/ignition.json but, ew?

Yeah... was looking at this too. Modifying the Ignition config indeed feels hacky.

I think trying to detect this in ignition-ostree-transposefs-detect.service would be tricky. We actually already have all the implementation logic we need for this. Bits to save and restore to/from RAM, and bits which know to grow the partition and possibly LUKS device underneath too.

Maybe simplest is to detect this at xfs_growfs time in ignition-ostree-growfs.service? We'd have to move it to run before the sysroot mount. E.g. something like

diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-growfs.sh b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-growfs.sh
index d20b6a08..fa4326e8 100755
--- a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-growfs.sh
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-growfs.sh
@@ -111,7 +111,19 @@ wipefs -af -t "no${ROOTFS_TYPE}" "${src}"
 # TODO: Add XFS to https://github.com/systemd/systemd/blob/master/src/partition/growfs.c
 # and use it instead.
 case "${ROOTFS_TYPE}" in
-    xfs) xfs_growfs "${path}" ;;
+    xfs)
+        if ! xfs_will_grow_too_much; then
+            xfs_growfs "${src}"
+        else
+            source /usr/lib/ignition-ostree/transposefs-lib.sh
+            dev=$(activate_zram)
+            save_filesystem root
+            wipefs -a "${src}"
+            mkfs.xfs "${root_part}" -L root -m reflink=1
+            restore_filesystem root
+            deactivate_zram "${dev}"
+        fi
+        ;;
     ext4) resize2fs "${src}" ;;
     btrfs) btrfs filesystem resize max ${path} ;;
 esac

@nivekuil
Copy link
Author

nivekuil commented May 18, 2022

Why is it recommended to have a separate /var? I can think of full disk prevention, but that seems more cleanly done with cgroups on machine.slice

@jlebon jlebon added the jira for syncing to jira label Mar 6, 2023
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Mar 22, 2023
Add a new transposefs unit: `autosave-xfs`. This unit runs after
`ignition-disks` and `ignition-ostree-growfs,` but before the `restore`
transposefs unit.

If the XFS root was grown, it checks if the allocation group count
(agcount) is within a reasonable amount (128 is chosen here). If
it isn't, it saves the rootfs and reformats the filesystem. The
`restore` unit will then restore it as usual. In the case of in-place
reprovisioning like LUKS (i.e. where the partition table isn't modified
by the Ignition config), the rootfs is still saved only once.

Ideally, instead of adding a new transposefs unit, we would make it
part of the initial `save` unit. But at that point, there's no way to
tell whether we should autosave without gazing even more deeply into the
Ignition config. We also don't want to unconditionally save the rootfs
when we may not need it.

Closes: coreos/fedora-coreos-tracker#1183
@jlebon
Copy link
Member

jlebon commented Mar 22, 2023

Option A implemented in coreos/fedora-coreos-config#2320.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Apr 5, 2023
Add a new transposefs unit: `autosave-xfs`. This unit runs after
`ignition-disks` and `ignition-ostree-growfs,` but before the `restore`
transposefs unit.

If the XFS root was grown, it checks if the allocation group count
(agcount) is within a reasonable amount (128 is chosen here). If
it isn't, it saves the rootfs and reformats the filesystem. The
`restore` unit will then restore it as usual. In the case of in-place
reprovisioning like LUKS (i.e. where the partition table isn't modified
by the Ignition config), the rootfs is still saved only once.

Ideally, instead of adding a new transposefs unit, we would make it
part of the initial `save` unit. But at that point, there's no way to
tell whether we should autosave without gazing even more deeply into the
Ignition config. We also don't want to unconditionally save the rootfs
when we may not need it.

Closes: coreos/fedora-coreos-tracker#1183
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Apr 6, 2023
Add a new transposefs unit: `autosave-xfs`. This unit runs after
`ignition-disks` and `ignition-ostree-growfs,` but before the `restore`
transposefs unit.

If the XFS root was grown, it checks if the allocation group count
(agcount) is within a reasonable amount (128 is chosen here). If
it isn't, it saves the rootfs and reformats the filesystem. The
`restore` unit will then restore it as usual. In the case of in-place
reprovisioning like LUKS (i.e. where the partition table isn't modified
by the Ignition config), the rootfs is still saved only once.

Ideally, instead of adding a new transposefs unit, we would make it
part of the initial `save` unit. But at that point, there's no way to
tell whether we should autosave without gazing even more deeply into the
Ignition config. We also don't want to unconditionally save the rootfs
when we may not need it.

Closes: coreos/fedora-coreos-tracker#1183
jlebon added a commit to coreos/fedora-coreos-config that referenced this issue Apr 6, 2023
Add a new transposefs unit: `autosave-xfs`. This unit runs after
`ignition-disks` and `ignition-ostree-growfs,` but before the `restore`
transposefs unit.

If the XFS root was grown, it checks if the allocation group count
(agcount) is within a reasonable amount (128 is chosen here). If
it isn't, it saves the rootfs and reformats the filesystem. The
`restore` unit will then restore it as usual. In the case of in-place
reprovisioning like LUKS (i.e. where the partition table isn't modified
by the Ignition config), the rootfs is still saved only once.

Ideally, instead of adding a new transposefs unit, we would make it
part of the initial `save` unit. But at that point, there's no way to
tell whether we should autosave without gazing even more deeply into the
Ignition config. We also don't want to unconditionally save the rootfs
when we may not need it.

Closes: coreos/fedora-coreos-tracker#1183
@dustymabe dustymabe added status/pending-testing-release Fixed upstream. Waiting on a testing release. status/pending-next-release Fixed upstream. Waiting on a next release. and removed status/pending-action Needs action labels Apr 7, 2023
@dustymabe
Copy link
Member

The fix for this went into next stream release 38.20230408.1.1. Please try out the new release and report issues.

@dustymabe dustymabe removed the status/pending-next-release Fixed upstream. Waiting on a next release. label Apr 14, 2023
@dustymabe dustymabe changed the title coreos autoinstall creates huge number of xfs allocation groups CoreOS autoinstall creates huge number of xfs allocation groups Apr 14, 2023
@dustymabe dustymabe changed the title CoreOS autoinstall creates huge number of xfs allocation groups CoreOS autoinstall creates huge number of XFS allocation groups Apr 14, 2023
@dustymabe
Copy link
Member

The fix for this went into testing stream release 38.20230414.2.0. Please try out the new release and report issues.

@dustymabe dustymabe added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-testing-release Fixed upstream. Waiting on a testing release. labels Apr 18, 2023
@dustymabe
Copy link
Member

The fix for this went into stable stream release 38.20230414.3.0.

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label May 3, 2023
c4rt0 pushed a commit to c4rt0/fedora-coreos-config that referenced this issue May 17, 2023
Add a new transposefs unit: `autosave-xfs`. This unit runs after
`ignition-disks` and `ignition-ostree-growfs,` but before the `restore`
transposefs unit.

If the XFS root was grown, it checks if the allocation group count
(agcount) is within a reasonable amount (128 is chosen here). If
it isn't, it saves the rootfs and reformats the filesystem. The
`restore` unit will then restore it as usual. In the case of in-place
reprovisioning like LUKS (i.e. where the partition table isn't modified
by the Ignition config), the rootfs is still saved only once.

Ideally, instead of adding a new transposefs unit, we would make it
part of the initial `save` unit. But at that point, there's no way to
tell whether we should autosave without gazing even more deeply into the
Ignition config. We also don't want to unconditionally save the rootfs
when we may not need it.

Closes: coreos/fedora-coreos-tracker#1183
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Add a new transposefs unit: `autosave-xfs`. This unit runs after
`ignition-disks` and `ignition-ostree-growfs,` but before the `restore`
transposefs unit.

If the XFS root was grown, it checks if the allocation group count
(agcount) is within a reasonable amount (128 is chosen here). If
it isn't, it saves the rootfs and reformats the filesystem. The
`restore` unit will then restore it as usual. In the case of in-place
reprovisioning like LUKS (i.e. where the partition table isn't modified
by the Ignition config), the rootfs is still saved only once.

Ideally, instead of adding a new transposefs unit, we would make it
part of the initial `save` unit. But at that point, there's no way to
tell whether we should autosave without gazing even more deeply into the
Ignition config. We also don't want to unconditionally save the rootfs
when we may not need it.

Closes: coreos/fedora-coreos-tracker#1183
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Add a new transposefs unit: `autosave-xfs`. This unit runs after
`ignition-disks` and `ignition-ostree-growfs,` but before the `restore`
transposefs unit.

If the XFS root was grown, it checks if the allocation group count
(agcount) is within a reasonable amount (128 is chosen here). If
it isn't, it saves the rootfs and reformats the filesystem. The
`restore` unit will then restore it as usual. In the case of in-place
reprovisioning like LUKS (i.e. where the partition table isn't modified
by the Ignition config), the rootfs is still saved only once.

Ideally, instead of adding a new transposefs unit, we would make it
part of the initial `save` unit. But at that point, there's no way to
tell whether we should autosave without gazing even more deeply into the
Ignition config. We also don't want to unconditionally save the rootfs
when we may not need it.

Closes: coreos/fedora-coreos-tracker#1183
@cgwalters
Copy link
Member

https://lore.kernel.org/all/[email protected]/ was posted,

For container/vm orchestration software, this isn't a huge issue as they
generally grow the image from within the initramfs context on first boot. That
is currently a "mount; xfs_growfs" operation pair; adding expansion to this
would simply require adding expansion before the mount. i.e. first boot becomes
a "xfs_expand; mount; xfs_growfs" operation. Depending on the eventual size of
the target filesystem, the xfs-growfs operation may be a no-op.

Yes, such a thing would be trivial for systems like CoreOS that use ignition in the initramfs, the change to add an xfs_expand invocation would probably be a one-liner.

However...today cloud-init builds (e.g. many stock Fedora/RHEL cloud images) don't do provisioning in the initramfs, but do so in the real root.
A large complexity is basically that some users today may be explicitly configuring distinct partitions (e.g. instead of expanding /, they configure /var/database or whatever) and how that gets configured would (given a generic image) need to be done by fetching instance metadata which again happens in the real root for those systems. (xref https://cloudinit.readthedocs.io/en/latest/reference/examples.html#create-partitions-and-filesystems )

r4f4 added a commit to r4f4/installer that referenced this issue Dec 11, 2024
Using the workaround in [1] until the issue is fixed in RHCOS.

[1] coreos/fedora-coreos-tracker#1183 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira kind/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.