-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bogus min size estimates by 'btrfs inspect min' #271
Comments
Contrived reproduce steps pretty much shows
Terrible script to try to get a bunch of partially filled bgs.
Check out the bg usage map.
Get a minimum shrink estimate.
Shrink below the minimum suggested.
Shrink succeeds.
|
Related |
I tried to investigate this issue.
92GB is very pessimistic when you have more than 52GB free. Try to resize to 50GB
Success ! In fact the 'btrfs inspect min' assumes that btrfs is capable to move only a full BG, so he try to relocate the BG in the holes between the allocated BGs. When the holes are exhausted, the position of the last BG sets the minimum size of the disk. However looking to the btrfs kernel code, I read a different story. When a BG is relocated, all its extents are relocated one at time. This means that a BG can be emptied relocating its content to other BG, and then (if empty) it can be deleted. My impression is that in the past, btrfs moved only BGs when it shrunk a disk; and 'btrfs inspect min' reflects this behavior. However now btrfs is capable to relocate the single extents of a BG, so 'btrfs inspect min' output is quite unrealistic. |
When a BG is relocated, the entire block group is relocated one extent at a time. Not quite the same thing as relocating one extent at a time--if it fails, the whole block group stays where it is. Each extent will pass individually through the allocator, so e.g. if you have a 1G block group and it contains 8x 128M extents, and every other block group contains 4K, balancing will decrease available space by 128M-4K because it can't pack the block group as efficiently as it was when the balance started. So in general, you can increase the minimum filesystem size by resizing it. There are also a lot of unexpected allocator behaviors I haven't had time to investigate. I've been running a project that I call "Century Balance" (named after the ETA at the start of the balance which was over 100 years). Century Balance is an attempt to balance a 50TB filesystem while converting between single and raid5 profiles. The running time for each iteration is measured in years, and the difference between the filesystem size I expect on paper and the filesystem size I get from btrfs differs by more than 10% at times. I'm not even sure the top-level concept of a "minimum" filesystem size as a single number makes sense. The minimum size changes by a few GB depending on whether you are using the discard mount option, running a balance or scrub, how many scrubs you're running at a time, whether you plan to delete any snapshots, the final total filesystem size, and probably several other factors I haven't discovered yet. |
Ok, I read this as "the relocation is done by extent basis; however if something goes wrong the full BG stays where it is".
Sorry but I didn't understand this. If you have (BG = 1 GB) After a shrink/balancing you will end So the number of BG still be 2. I think that the worst case is that the minimum size of the filesystem is the sum of BG (without any extent relocation). At the minimum what you gain is to "pack" all the BG in the first part of the disk. A typical scenario is 1 x BG [ 1 x 600MB extent ] You will end in 1 x BG [ 1 x 600MB extent ] So you can reduce the size of the disk from 4 GB (allocated) to 3GB. Even if the space used is 1.7GB which would fit in 2 BG. |
The allocator doesn't guarantee that it will find the optimal layout even for single-pass algorithms (and single-pass is a huge constraint against optimal layout all by itself). So you could start with: 7x128M in a 1G block group, and end up with two block groups: 7x128M data 1x(128M-4K) data and you can have this pattern occur multiple times across a large filesystem. This seems to be due to intentional optimizations in the allocator that favor speed over size, though I have observed some curious nondeterministic behavior as well. Over hundreds or thousands of block groups this can introduce significant differences between theoretical and achievable size. Also not all block groups or holes are the same size, e.g. if your filesystem looks like this: 1G BG you'll need some temporary space to get rid of the 512M hole, or a relocation function that can move data smaller than a block group (e.g. copying the data in userspace and dedupe it over the original to remove the block group can be much more flexible than balance because it can split extents arbitrarily, though it's much slower). You could just get lucky and one of the BGs has less than 512M of data in it, and you're also lucky enough to balance it first, but you could have bad luck and move all the data in the filesystem without reducing its size, or you could move the 512M hole to the beginning of the filesystem. More complex cases require multiple passes, first to redistribute free space to make contiguous dev_extent holes >1G, and later passes to move data into 1G block groups (or 1G-aligned block groups to prevent <1G holes from reappearing). You can also have more complex cases, like a filesystem full of 960MB holes between 1G block groups and 128MB data extents. This is fairly common because system chunks are not 1GB long while all other chunks are 1GB (on filesystems >50GB). Each block group balanced could add a requirement for 64MB more space if they are balanced in the worst possible order (and the worst possible order introduces more dev_extent fragmentation, so the next resize will be even worse). The best possible order balances the first block group after a 960MB hole first, creating a 1.96GB hole that can then be filled with 1G block groups on the low side and extended by deleting 1G block groups on the high side. All other orders waste at least 960MB of space, and the worst order breaks the filesystem into 192MB dev_extents with 64MB holes between, wastes 50% of the space, and can't run balance any more (no extent will fit in any possible new block group). Balance has no logic capable of choosing the best possible order, so space will be lost--it's a question of how much, and whether it happens to stumble on an ordering that isn't the worst possible order. The worst cases seem to happen on block groups using striping profiles (raid0, 10, 5, and 6). Century Balance ended up with 40 block groups of each of the 1024 possible sizes from 1M to 1G. Even very simple filesystems (1TB single disk) can end up with a dozen different block group sizes because of system chunks, and require a multi-pass algorithm to get their dev_extents defragmented. If a multi-pass algorithm is required, then the minimum filesystem size for the first pass must be larger than the minimum filesystem size for the second pass, except under very specific conditions (e.g. all data extents are the same size). At the low extreme, the one block group that balance locks has a significant impact. e.g. a filesystem consisting of only one block group has a minimum size that is larger than the current size of the filesystem. Because of the block group locking, all of the existing space cannot be considered when calculating the filesystem's minimum size for resize, so the "minimum size" is the current size of the filesystem (which is unusable because of block group locking) plus the size of all the data in the filesystem (which has to be relocated somewhere else during resize). Or put another way, you do the minimum size calculation and then don't do the resize, because the best case outcome of one pass of the resize algorithm is a larger filesystem than what you started with. |
Example 1
I'm expecting minimum size around 25-26G. 1MiB is plainly not possible.
Example 2:
I'm expecting a minimum shrink value of ~20-21G. The estimate seems based on 'device allocated' amount.
strace-example1.txt
strace-example2.txt
The text was updated successfully, but these errors were encountered: