Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bogus min size estimates by 'btrfs inspect min' #271

Open
cmurf opened this issue Jul 12, 2020 · 6 comments
Open

bogus min size estimates by 'btrfs inspect min' #271

cmurf opened this issue Jul 12, 2020 · 6 comments
Labels

Comments

@cmurf
Copy link

cmurf commented Jul 12, 2020

5.8.0-0.rc4.1.fc33.x86_64+debug
btrfs-progs v5.7 

Example 1

$ sudo btrfs insp min /
1048576 bytes (1.00MiB)
$ sudo btrfs fi us /
Overall:
    Device size:		 105.00GiB
    Device allocated:		  55.03GiB
    Device unallocated:		  49.97GiB
    Device missing:		     0.00B
    Used:			  24.80GiB
    Free (estimated):		  78.75GiB	(min: 78.75GiB)
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		  62.25MiB	(used: 0.00B)
    Multiple profiles:		        no

Data,single: Size:53.00GiB, Used:24.22GiB (45.69%)
   /dev/sda4	  53.00GiB

Metadata,single: Size:2.00GiB, Used:596.48MiB (29.13%)
   /dev/sda4	   2.00GiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/sda4	  32.00MiB

Unallocated:
   /dev/sda4	  49.97GiB

I'm expecting minimum size around 25-26G. 1MiB is plainly not possible.

Example 2:

$ sudo btrfs inspect min /
50542411776 bytes (47.07GiB)
$ sudo btrfs fi us /
Overall:
    Device size:		 178.00GiB
    Device allocated:		  46.04GiB
    Device unallocated:		 131.96GiB
    Device missing:		     0.00B
    Used:			  19.42GiB
    Free (estimated):		 156.41GiB	(min: 156.41GiB)
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		  64.09MiB	(used: 0.00B)
    Multiple profiles:		        no

Data,single: Size:43.01GiB, Used:18.56GiB (43.15%)
   /dev/nvme0n1p7	  43.01GiB

Metadata,single: Size:3.00GiB, Used:879.53MiB (28.63%)
   /dev/nvme0n1p7	   3.00GiB

System,single: Size:32.00MiB, Used:32.00KiB (0.10%)
   /dev/nvme0n1p7	  32.00MiB

Unallocated:
   /dev/nvme0n1p7	 131.96GiB

I'm expecting a minimum shrink value of ~20-21G. The estimate seems based on 'device allocated' amount.

strace-example1.txt
strace-example2.txt

@cmurf
Copy link
Author

cmurf commented Jul 12, 2020

Contrived reproduce steps pretty much shows btrfs insp min isn't working correctly. The resize below its min size succeeds.

truncate -s 100g test.raw
losetup /dev/loop0 test.raw
mkfs.btrfs -msingle -dsingle /dev/loop0
mount /dev/loop0 /media
cd /media

Terrible script to try to get a bunch of partially filled bgs.

#!/bin/bash
for i in $(seq 90);do
    fallocate -l 1g "File${i}"
done
sync ./
for i in $(seq 90);do
    if (( $i % 2 )); then
    rm "File${i}"
    fi
done

Check out the bg usage map.

# /home/chris/Applications/btrfs-debugfs -b /media
block group offset       22020096 len 1073741824 used   24379392 chunk_objectid 256 flags 1 usage 0.02
block group offset     1095761920 len 1073741824 used          0 chunk_objectid 256 flags 1 usage 0.00
block group offset     2169503744 len 1073741824 used          0 chunk_objectid 256 flags 1 usage 0.00
block group offset     3243245568 len 1073741824 used          0 chunk_objectid 256 flags 1 usage 0.00
block group offset     4316987392 len 1073741824 used          0 chunk_objectid 256 flags 1 usage 0.00
block group offset     5390729216 len 1073741824 used          0 chunk_objectid 256 flags 1 usage 0.00
block group offset     6464471040 len 1073741824 used          0 chunk_objectid 256 flags 1 usage 0.00
block group offset     7538212864 len 1073741824 used          0 chunk_objectid 256 flags 1 usage 0.00
block group offset     8611954688 len 1073741824 used          0 chunk_objectid 256 flags 1 usage 0.00
block group offset     9685696512 len 1073741824 used          0 chunk_objectid 256 flags 1 usage 0.00
block group offset    10759438336 len 1073741824 used          0 chunk_objectid 256 flags 1 usage 0.00
block group offset    11833180160 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    12906921984 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    13980663808 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    15054405632 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    16128147456 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    17201889280 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    18275631104 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    19349372928 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    20423114752 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    21496856576 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    22570598400 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    23644340224 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    24718082048 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    25791823872 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    26865565696 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    27939307520 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    29013049344 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    30086791168 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    31160532992 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    32234274816 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    33308016640 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    34381758464 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    35455500288 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    36529242112 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    37602983936 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    38676725760 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    39750467584 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    40824209408 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    41897951232 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    42971693056 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    44045434880 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    45119176704 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    46192918528 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    47266660352 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    48340402176 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    49414144000 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    50487885824 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    51561627648 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    52635369472 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    53709111296 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    54782853120 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    55856594944 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    56930336768 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    58004078592 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    59077820416 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    60151562240 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    61225304064 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    62299045888 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    63372787712 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    64446529536 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    65520271360 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    66594013184 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    67667755008 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    68741496832 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    69815238656 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    70888980480 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    71962722304 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    73036464128 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    74110205952 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    75183947776 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    76257689600 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    77331431424 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    78405173248 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    79478915072 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    80552656896 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    81626398720 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    82700140544 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    83773882368 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    84847624192 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    85921366016 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    86995107840 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    88068849664 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    89142591488 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    90216333312 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    91290075136 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    92363816960 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    93437558784 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    94511300608 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    95585042432 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    96658784256 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
block group offset    97732526080 len 1073741824 used  805306368 chunk_objectid 256 flags 1 usage 0.75
block group offset    98806267904 len 1073741824 used  268435456 chunk_objectid 256 flags 1 usage 0.25
total_free 55810195456 min_used 0 free_of_min_used 1073741824 block_group_of_min_used 10759438336
balance block group (10759438336) can reduce the number of data block group

# btrfs fi us /media
Overall:
    Device size:		 100.00GiB
    Device allocated:		  71.01GiB
    Device unallocated:		  28.99GiB
    Device missing:		     0.00B
    Used:			  45.02GiB
    Free (estimated):		  54.97GiB	(min: 54.97GiB)
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		   3.25MiB	(used: 0.00B)
    Multiple profiles:		        no

Data,single: Size:71.00GiB, Used:45.02GiB (63.40%)
   /dev/loop0	  71.00GiB

Metadata,single: Size:8.00MiB, Used:240.00KiB (2.93%)
   /dev/loop0	   8.00MiB

System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
   /dev/loop0	   4.00MiB

Unallocated:
   /dev/loop0	  28.99GiB

Get a minimum shrink estimate.

# btrfs insp min /media
77356597248 bytes (72.04GiB)

# btrfs fi show
Label: none  uuid: d2c502c8-0a13-4319-a2f5-1e9b8cde6e21
	Total devices 1 FS bytes used 45.02GiB
	devid    1 size 100.00GiB used 71.01GiB path /dev/loop0

Shrink below the minimum suggested.

# btrfs fi resize 1:60g /media
Resize '/media' of '1:60g'

Shrink succeeds.

# btrfs fi show
Label: none  uuid: d2c502c8-0a13-4319-a2f5-1e9b8cde6e21
	Total devices 1 FS bytes used 45.01GiB
	devid    1 size 60.00GiB used 47.01GiB path /dev/loop0

# dmesg
...
[  809.184095] loop: module loaded
[  819.331244] BTRFS: device fsid d2c502c8-0a13-4319-a2f5-1e9b8cde6e21 devid 1 transid 5 /dev/loop0 scanned by mkfs.btrfs (1631)
[  825.551808] BTRFS info (device loop0): disk space caching is enabled
[  825.551820] BTRFS info (device loop0): has skinny extents
[  825.551826] BTRFS info (device loop0): flagging fs with big metadata feature
[  825.557609] BTRFS info (device loop0): enabling ssd optimizations
[  825.557832] BTRFS info (device loop0): checking UUID tree
[ 1940.201619] BTRFS info (device loop0): resizing devid 1
[ 1940.230687] BTRFS info (device loop0): relocating block group 96658784256 flags data
[ 1941.319242] BTRFS info (device loop0): found 1 extents, stage: move data extents
[ 1941.916171] BTRFS info (device loop0): found 1 extents, stage: update data pointers
[ 1941.981280] BTRFS info (device loop0): relocating block group 95585042432 flags data
[ 1945.458993] BTRFS info (device loop0): found 3 extents, stage: move data extents
[ 1946.568904] BTRFS info (device loop0): found 3 extents, stage: update data pointers
[ 1946.618748] BTRFS info (device loop0): relocating block group 93437558784 flags data
[ 1947.674701] BTRFS info (device loop0): found 1 extents, stage: move data extents
[ 1948.338065] BTRFS info (device loop0): found 1 extents, stage: update data pointers
[ 1948.392810] BTRFS info (device loop0): relocating block group 92363816960 flags data
[ 1952.989938] BTRFS info (device loop0): found 4 extents, stage: move data extents
[ 1953.745552] BTRFS info (device loop0): found 4 extents, stage: update data pointers
[ 1953.799482] BTRFS info (device loop0): relocating block group 91290075136 flags data
[ 1958.807490] BTRFS info (device loop0): found 3 extents, stage: move data extents
[ 1958.967339] BTRFS info (device loop0): found 3 extents, stage: update data pointers
[ 1959.053767] BTRFS info (device loop0): relocating block group 89142591488 flags data
[ 1960.148365] BTRFS info (device loop0): found 1 extents, stage: move data extents
[ 1961.220043] BTRFS info (device loop0): found 1 extents, stage: update data pointers
[ 1961.296691] BTRFS info (device loop0): relocating block group 88068849664 flags data
[ 1967.239693] BTRFS info (device loop0): found 4 extents, stage: move data extents
[ 1968.318766] BTRFS info (device loop0): found 4 extents, stage: update data pointers
[ 1968.398055] BTRFS info (device loop0): relocating block group 86995107840 flags data
[ 1973.328082] BTRFS info (device loop0): found 3 extents, stage: move data extents
[ 1974.514197] BTRFS info (device loop0): found 3 extents, stage: update data pointers
[ 1974.606972] BTRFS info (device loop0): relocating block group 84847624192 flags data
[ 1975.682479] BTRFS info (device loop0): found 1 extents, stage: move data extents
[ 1976.860417] BTRFS info (device loop0): found 1 extents, stage: update data pointers
[ 1976.936851] BTRFS info (device loop0): relocating block group 83773882368 flags data
[ 1979.257083] register-python (2425) used greatest stack depth: 11320 bytes left
[ 1984.156603] BTRFS info (device loop0): found 4 extents, stage: move data extents
[ 1984.234248] BTRFS info (device loop0): found 4 extents, stage: update data pointers
[ 1984.320864] BTRFS info (device loop0): relocating block group 82700140544 flags data
[ 1990.487070] BTRFS info (device loop0): found 3 extents, stage: move data extents
[ 1990.823143] BTRFS info (device loop0): found 3 extents, stage: update data pointers
[ 1990.905550] BTRFS info (device loop0): relocating block group 80552656896 flags data
[ 1992.006413] BTRFS info (device loop0): found 1 extents, stage: move data extents
[ 1993.385682] BTRFS info (device loop0): found 1 extents, stage: update data pointers
[ 1993.476219] BTRFS info (device loop0): relocating block group 79478915072 flags data
[ 2002.053800] BTRFS info (device loop0): found 4 extents, stage: move data extents
[ 2002.135986] BTRFS info (device loop0): found 4 extents, stage: update data pointers
[ 2002.224580] BTRFS info (device loop0): relocating block group 78405173248 flags data
[ 2007.903679] BTRFS info (device loop0): found 3 extents, stage: move data extents
[ 2009.986309] BTRFS info (device loop0): found 3 extents, stage: update data pointers
[ 2010.073363] BTRFS info (device loop0): relocating block group 76257689600 flags data
[ 2011.173721] BTRFS info (device loop0): found 1 extents, stage: move data extents
[ 2013.002750] BTRFS info (device loop0): found 1 extents, stage: update data pointers
[ 2013.080625] BTRFS info (device loop0): relocating block group 75183947776 flags data
[ 2023.202883] BTRFS info (device loop0): found 4 extents, stage: move data extents
[ 2023.273449] BTRFS info (device loop0): found 4 extents, stage: update data pointers
[ 2023.361832] BTRFS info (device loop0): relocating block group 74110205952 flags data
[ 2032.853760] BTRFS info (device loop0): found 3 extents, stage: move data extents
[ 2032.936725] BTRFS info (device loop0): found 3 extents, stage: update data pointers
[ 2033.032209] BTRFS info (device loop0): relocating block group 71962722304 flags data
[ 2034.140232] BTRFS info (device loop0): found 1 extents, stage: move data extents
[ 2036.639227] BTRFS info (device loop0): found 1 extents, stage: update data pointers
[ 2036.755283] BTRFS info (device loop0): relocating block group 70888980480 flags data
[ 2049.507142] BTRFS info (device loop0): found 4 extents, stage: move data extents
[ 2049.608053] BTRFS info (device loop0): found 4 extents, stage: update data pointers
[ 2049.712009] BTRFS info (device loop0): relocating block group 69815238656 flags data
[ 2060.183711] BTRFS info (device loop0): found 3 extents, stage: move data extents
[ 2060.268794] BTRFS info (device loop0): found 3 extents, stage: update data pointers
[ 2060.367497] BTRFS info (device loop0): relocating block group 67667755008 flags data
[ 2061.476323] BTRFS info (device loop0): found 1 extents, stage: move data extents
[ 2063.679664] BTRFS info (device loop0): found 1 extents, stage: update data pointers
[ 2063.770147] BTRFS info (device loop0): relocating block group 66594013184 flags data
[ 2078.041004] BTRFS info (device loop0): found 4 extents, stage: move data extents
[ 2078.126446] BTRFS info (device loop0): found 4 extents, stage: update data pointers
[ 2078.196914] BTRFS info (device loop0): relocating block group 65520271360 flags data
[ 2089.814043] BTRFS info (device loop0): found 3 extents, stage: move data extents
[ 2089.914795] BTRFS info (device loop0): found 3 extents, stage: update data pointers
[ 2089.994960] BTRFS info (device loop0): relocating block group 63372787712 flags data
[ 2091.081320] BTRFS info (device loop0): found 1 extents, stage: move data extents
[ 2094.636617] BTRFS info (device loop0): found 1 extents, stage: update data pointers
[ 2094.727825] BTRFS info (device loop0): resize device /dev/loop0 (devid 1) from 107374182400 to 64424509440
$ 

@cmurf
Copy link
Author

cmurf commented Jul 12, 2020

Related
storaged-project/libblockdev#548

@kdave kdave added the bug label Jul 14, 2020
@kreijack
Copy link
Contributor

I tried to investigate this issue.
First, the test case reported #271 (comment) doesn't work. I Used a slightly different one, changing the size of the file from 1GB to 256MB and increasing the number of files from 90 to 90*4; and with these new parameters, the test case was able to highlight the problem.

# truncate -s 100g test.raw
# losetup /dev/loop0 test.raw
# mkfs.btrfs /dev/loop0
# mount /dev/loop0 /media
# cd /media

# cat fill.sh
for i in $(seq 360);do
    fallocate -l 256m "File${i}"
done
sync ./
for i in $(seq 360);do
   if (( $i % 2 )); then
       rm "File${i}"
   fi
done

# sh fill.sh

# btrfs fi us /mnt/test | | egrep "Free \(estimated|Device size"
    Device size:		 100.00GiB
    Free (estimated):		  52.98GiB	(min: 48.99GiB)

# btrfs insp min /mnt/test             
98823045120 bytes (92.04GiB)

92GB is very pessimistic when you have more than 52GB free. Try to resize to 50GB

# btrfs fi resize $((100-50))G /mnt/test
Resize device id 1 (/dev/vdc) from 100.00GiB to 50.00GiB
# btrfs fi us /mnt/test | egrep "Free \(estimated|Device size"
    Device size:		  50.00GiB
    Free (estimated):		   2.98GiB	(min: 2.49GiB)

Success !

In fact the 'btrfs inspect min' assumes that btrfs is capable to move only a full BG, so he try to relocate the BG in the holes between the allocated BGs. When the holes are exhausted, the position of the last BG sets the minimum size of the disk.
This means that if you have a disk all filled with BGs filled at 50%, you have free of the disk free but you can't reclaim space.

However looking to the btrfs kernel code, I read a different story. When a BG is relocated, all its extents are relocated one at time. This means that a BG can be emptied relocating its content to other BG, and then (if empty) it can be deleted.

My impression is that in the past, btrfs moved only BGs when it shrunk a disk; and 'btrfs inspect min' reflects this behavior. However now btrfs is capable to relocate the single extents of a BG, so 'btrfs inspect min' output is quite unrealistic.

@Zygo
Copy link

Zygo commented Jan 29, 2022

When a BG is relocated, the entire block group is relocated one extent at a time. Not quite the same thing as relocating one extent at a time--if it fails, the whole block group stays where it is.

Each extent will pass individually through the allocator, so e.g. if you have a 1G block group and it contains 8x 128M extents, and every other block group contains 4K, balancing will decrease available space by 128M-4K because it can't pack the block group as efficiently as it was when the balance started. So in general, you can increase the minimum filesystem size by resizing it.

There are also a lot of unexpected allocator behaviors I haven't had time to investigate. I've been running a project that I call "Century Balance" (named after the ETA at the start of the balance which was over 100 years). Century Balance is an attempt to balance a 50TB filesystem while converting between single and raid5 profiles. The running time for each iteration is measured in years, and the difference between the filesystem size I expect on paper and the filesystem size I get from btrfs differs by more than 10% at times.

I'm not even sure the top-level concept of a "minimum" filesystem size as a single number makes sense. The minimum size changes by a few GB depending on whether you are using the discard mount option, running a balance or scrub, how many scrubs you're running at a time, whether you plan to delete any snapshots, the final total filesystem size, and probably several other factors I haven't discovered yet.

@kreijack
Copy link
Contributor

When a BG is relocated, the entire block group is relocated one extent at a time. Not quite the same thing as relocating one extent at a time--if it fails, the whole block group stays where it is.

Ok, I read this as "the relocation is done by extent basis; however if something goes wrong the full BG stays where it is".
The point is the likelihood of "something goes wrong"

Each extent will pass individually through the allocator, so e.g. if you have a 1G block group and it
contains 8x 128M extents, and every other block group contains 4K, balancing will decrease available
space by 128M-4K because it can't pack the block group as efficiently as it was when the balance
started. So in general, you can increase the minimum filesystem size by resizing it.

Sorry but I didn't understand this. If you have (BG = 1 GB)
1 x BG: 1 x 4k extent
1 x BG: 8 x 128M extents

After a shrink/balancing you will end
1 x BG: 1x 4k extent + 7 x 128M extents
1 x BG: 1 x 128M extents

So the number of BG still be 2.

I think that the worst case is that the minimum size of the filesystem is the sum of BG (without any extent relocation). At the minimum what you gain is to "pack" all the BG in the first part of the disk.
The best case is that you can pack all the extents in the first part if the disk; the difficult is to find space for an extent.

A typical scenario is

1 x BG [ 1 x 600MB extent ]
[ hole: 1GB, a BG fully reclaimed in the past ]
1 x BG [ 1 x 600MB extent ]
1 x BG [ 1 x 500MB extent ]

You will end in

1 x BG [ 1 x 600MB extent ]
1 x BG [ 1 x 600MB extent ]
1 x BG [ 1 x 500MB extent ]

So you can reduce the size of the disk from 4 GB (allocated) to 3GB. Even if the space used is 1.7GB which would fit in 2 BG.

@Zygo
Copy link

Zygo commented Jan 30, 2022

The allocator doesn't guarantee that it will find the optimal layout even for single-pass algorithms (and single-pass is a huge constraint against optimal layout all by itself). So you could start with:

7x128M
1x4K
1x(128M-4K)

in a 1G block group, and end up with two block groups:

7x128M data
1x4K data
1x(128M-4K) free space

1x(128M-4K) data
1x(1G+4K-128M) free space

and you can have this pattern occur multiple times across a large filesystem.

This seems to be due to intentional optimizations in the allocator that favor speed over size, though I have observed some curious nondeterministic behavior as well. Over hundreds or thousands of block groups this can introduce significant differences between theoretical and achievable size.

Also not all block groups or holes are the same size, e.g. if your filesystem looks like this:

1G BG
512M empty space (e.g. where a system chunk used to be)
1G BG

you'll need some temporary space to get rid of the 512M hole, or a relocation function that can move data smaller than a block group (e.g. copying the data in userspace and dedupe it over the original to remove the block group can be much more flexible than balance because it can split extents arbitrarily, though it's much slower). You could just get lucky and one of the BGs has less than 512M of data in it, and you're also lucky enough to balance it first, but you could have bad luck and move all the data in the filesystem without reducing its size, or you could move the 512M hole to the beginning of the filesystem. More complex cases require multiple passes, first to redistribute free space to make contiguous dev_extent holes >1G, and later passes to move data into 1G block groups (or 1G-aligned block groups to prevent <1G holes from reappearing).

You can also have more complex cases, like a filesystem full of 960MB holes between 1G block groups and 128MB data extents. This is fairly common because system chunks are not 1GB long while all other chunks are 1GB (on filesystems >50GB). Each block group balanced could add a requirement for 64MB more space if they are balanced in the worst possible order (and the worst possible order introduces more dev_extent fragmentation, so the next resize will be even worse). The best possible order balances the first block group after a 960MB hole first, creating a 1.96GB hole that can then be filled with 1G block groups on the low side and extended by deleting 1G block groups on the high side. All other orders waste at least 960MB of space, and the worst order breaks the filesystem into 192MB dev_extents with 64MB holes between, wastes 50% of the space, and can't run balance any more (no extent will fit in any possible new block group). Balance has no logic capable of choosing the best possible order, so space will be lost--it's a question of how much, and whether it happens to stumble on an ordering that isn't the worst possible order.

The worst cases seem to happen on block groups using striping profiles (raid0, 10, 5, and 6). Century Balance ended up with 40 block groups of each of the 1024 possible sizes from 1M to 1G. Even very simple filesystems (1TB single disk) can end up with a dozen different block group sizes because of system chunks, and require a multi-pass algorithm to get their dev_extents defragmented. If a multi-pass algorithm is required, then the minimum filesystem size for the first pass must be larger than the minimum filesystem size for the second pass, except under very specific conditions (e.g. all data extents are the same size).

At the low extreme, the one block group that balance locks has a significant impact. e.g. a filesystem consisting of only one block group has a minimum size that is larger than the current size of the filesystem. Because of the block group locking, all of the existing space cannot be considered when calculating the filesystem's minimum size for resize, so the "minimum size" is the current size of the filesystem (which is unusable because of block group locking) plus the size of all the data in the filesystem (which has to be relocated somewhere else during resize). Or put another way, you do the minimum size calculation and then don't do the resize, because the best case outcome of one pass of the resize algorithm is a larger filesystem than what you started with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants