Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fstrim on a xfs lv backed by vdo renders my computer useless for a very long time #64

Open
beertje44 opened this issue May 8, 2023 · 9 comments

Comments

@beertje44
Copy link

beertje44 commented May 8, 2023

I've expercienced this twice on a ALmaLinux box with a single PV and VG on a Samsung PM9A3 3.8TB Nvme SSD: when fstrim runs the system doesn't crash but the load becomes unworkable (in excess of 180!). It seems this is related to my vdo backed (compression and deduplication) main xfs LV. Last time I somehow managed to kill fstrim and everything returned to normal shortly after that.

Apart from the excessive load, dmesg tells me:

135.790363] INFO: task kworker/11:3:21829 blocked for more than 122 seconds. [79135.790368] Tainted: P OE 6.2.12-1.el9_1.******x86_64 #1 [79135.790371] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [79135.790373] task:kworker/11:3 state:D stack:0 pid:21829 ppid:2 flags:0x00004000 [79135.790380] Workqueue: xfs-inodegc/dm-2 xfs_inodegc_worker [xfs] [79135.790575] Call Trace: [79135.790577] <TASK> [79135.790581] __schedule+0x1fb/0x550 [79135.790589] schedule+0x5d/0xd0 [79135.790595] schedule_timeout+0x148/0x160 [79135.790602] ___down_common+0x111/0x170 [79135.790612] ? down+0x1a/0x60 [79135.790621] __down_common+0x1e/0xc0 [79135.790647] down+0x43/0x60 [79135.790659] xfs_buf_lock+0x2d/0xe0 [xfs] [79135.790857] xfs_buf_find_lock+0x45/0xf0 [xfs] [79135.791039] xfs_buf_lookup.constprop.0+0xe4/0x170 [xfs] [79135.791222] xfs_buf_get_map+0xc1/0x3a0 [xfs] [79135.791407] xfs_buf_read_map+0x54/0x2a0 [xfs] [79135.791593] ? xfs_read_agf+0x89/0x130 [xfs] [79135.791822] xfs_trans_read_buf_map+0x115/0x300 [xfs] [79135.792068] ? xfs_read_agf+0x89/0x130 [xfs] [79135.792253] xfs_read_agf+0x89/0x130 [xfs] [79135.792427] xfs_alloc_read_agf+0x50/0x210 [xfs] [79135.792602] xfs_alloc_fix_freelist+0x3dd/0x510 [xfs] [79135.792801] ? preempt_count_add+0x70/0xa0 [79135.792809] ? _raw_spin_lock+0x13/0x40 [79135.792816] ? _raw_spin_unlock+0x15/0x30 [79135.792823] ? xfs_inode_to_log_dinode+0x210/0x410 [xfs] [79135.793039] ? xfs_efi_item_format+0x72/0xd0 [xfs] [79135.793228] xfs_free_extent_fix_freelist+0x61/0xa0 [xfs] [79135.793409] __xfs_free_extent+0x72/0x1c0 [xfs] [79135.793584] xfs_trans_free_extent+0x45/0x100 [xfs] [79135.793809] xfs_extent_free_finish_item+0x69/0xa0 [xfs] [79135.793998] xfs_defer_finish_noroll+0x187/0x530 [xfs] [79135.794220] xfs_defer_finish+0x11/0x70 [xfs] [79135.794398] xfs_itruncate_extents_flags+0xca/0x250 [xfs] [79135.794608] xfs_inactive_truncate+0xab/0xe0 [xfs] [79135.794800] xfs_inactive+0x154/0x170 [xfs] [79135.794970] xfs_inodegc_worker+0xa3/0x170 [xfs] [79135.795156] process_one_work+0x1e5/0x3f0 [79135.795165] ? __pfx_worker_thread+0x10/0x10 [79135.795172] worker_thread+0x50/0x3a0 [79135.795179] ? __pfx_worker_thread+0x10/0x10 [79135.795185] kthread+0xe8/0x110 [79135.795189] ? __pfx_kthread+0x10/0x10 [79135.795194] ret_from_fork+0x2c/0x50 [79135.795205] </TASK>

vdo-8.2.0.2-1.el9.x86_64
kmod-kvdo-8.2.1.6-1.el9_1.*****.x86_64

The last one I got from here so I can use kernel-ml from elrepo. Also I can see several vdo kernel threads working during the problem.

@tigerblue77
Copy link

Hello, I can also confirm this behavior on professional hardware (Dell PowerEdge R720XD) with robust storage (12x 6TB hardware RAID 6) and it does indeed make a bunch of I/O that slows down the system drastically during fstrim. So I imagine that on a classic system (with a single disk, even SSD) this can be quite problematic.

@beertje44
Copy link
Author

Did some more digging around:

  • I booted the default kernel for AlmaLinux 9.1: 5.14.0-162.23.1.el9_1.x86_64. As soon as I entered time fstrim -v / the system almost completely locked up. That is: running processes continued to work but even a new shell process did not launch at all. So I did manage to reboot cleanly to recover from this.
  • From what I understand fstrim should not take very long or make the system completely unresponsive because of excessive load. But I'm no expert here :) FWIW: the manual at redhat.com for vdo does mention enabling the fstrim service as good practice....
  • I also booted into single user mode and entered time fstrim -v / with the same result: complete lockup. From the looks of it there was some high load on the ssd for a breef amount of time and then not so much (looking at the hdd led considering I was in single user mode without any tooling available).
  • I also use the ssd as log and cache device for my zfs storage pool. Zfs managed to trim both devices on demand without any problems or noticeable load.
  • Prometheus logged iops in excess of 200k related to discard operations when the fstrim service ran last time.

Currently running on kernel 6.2.12.

@tigerblue77
Copy link

2 advices :

  • run fstrim in background (add "&" at the end of your command) so that it returns the PID and you can just kill <PID>
  • don't mix VDO and ZFS

@beertje44
Copy link
Author

2 advices :

run fstrim in background (add "&" at the end of your command) so that it returns the PID and you can just kill <PID>

Could have done that, but last time the fstrim service ran it took the kernel about 15 minutes to actually force kill the fstrim process, after that the system returned to its normal state :/

don't mix VDO and ZFS

Yeah, you are probably right on that one. Just couldn't help myself. Looked like a great idea at the time: more space on my root device. Since I always seem to be short on that in past :D

FWIW: it's just a homelab server, not a big production server continually running at its peak performance. And for now I rather help solve the issue (if there is one to begin with) then to run away from it ;)

@raeburn
Copy link
Member

raeburn commented May 8, 2023

Yes, discards are a slow area in VDO currently. Right now each block is processed separately, making it about as costly as writing zero blocks to the same locations. If you do use fstrim, it would be best to use it at a time when other load on the system is likely to be light. The fstrim docs even point out that non-queued trim can have a performance impact on other work; in VDO’s case the penalty is a bit more severe, because VDO uses system resources (CPU, I/O bandwidth) rather than handling it all within the disk drive, and it doesn’t use them as efficiently as it probably could.

The “task … blocked” message is expected. It just means the calling thread has been waiting a while as VDO crunches away on a (possibly very large) discard request. If the worker thread sends a discard of 1 GB, for example, then VDO has to process 256k blocks before it can report that the operation is done, each possibly requiring journal updates, ref count updates, etc. There’s some parallelism, but that’s not enough to make 256k operations go quickly.

We’ve got a design that should improve discard handling, but haven’t scheduled the work yet.

In the meantime, there’s a way to control how many of VDO’s 2k internal I/Os-in-progress can be used for discards, if you don’t mind fiddling with low-level controls. Look in /sys/block/dm-/vdo/discards_limit, where dm- is the device name for the VDO device. The number there (default 1536) means about 3/4 of the pool can be used for discards, which could starve or drastically slow other work. If you write a smaller number into that file, it’ll slow the discards themselves a bit, but will reserve more of the pool for non-discard work. I hope that’s enough to make your system usable.

@beertje44
Copy link
Author

RedHat had some, a lot of info on tuning VDO for RHEL 7. Ofcourse that does not apply for the current LVM version. But hey I found those settings in lvm.conf and with so light googling around I managed to put it into a lvchange command. Ofcourse this can' t be done on a running system so you need a boot stick or something simular to pull it off:

lvchange --vdosettings 'ack_threads=4 bio_threads=8 cpu_threads=32 hash_zone_threads=4 logical_threads=4 physical_threads=4 max_discard=1024' neo/vpool0

I noticed the defaults were to use very few threads and I have 32 logical cores in this system.

fstrim still causes some delays (100% usage on the ssd for a short while) but I can' t call it lockups but rather slower then normal :) ANd best of it is ot finishes now:

# time fstrim -v / /: 3,3 TiB (3660264259584 bytes) is getrimd fstrim -v / 0,00s user 2,87s system 0% cpu 9:40,82 total

Will look into your suggestion, but its getting way too late over here right now ;)

@beertje44
Copy link
Author

Not marking this as closed yet, I had to do some guessing and googling ... IMHO this should at least be documented.

@tigerblue77
Copy link

But hey I found those settings in lvm.conf and with so light googling around I managed to put it into a lvchange command. Ofcourse this can' t be done on a running system so you need a boot stick or something simular to pull it off:

Have a look at this where you'll see how I configured mine on Debian with LVM and every VDO parameters.

@beertje44
Copy link
Author

Wow that is very cool! Thank you very much, that is way more then I could ask for.

However IMHO there a 2 things:

  • fstrim is the recommended way to keep at least a SSD in good working performance over time, also for VDO.
  • fstrim with default settings performs horribly on a VDO volume.

I really don't mind tweaking settings to work around this, but couldn't those defaults be a bit different? I'm a huge fan of RedHat for a really long time now, yeah I'm getiing old. And also in this I like what they are doing with it. But if all of this is intended for people with extensive storage knowledge and the will to take the time to do a proper setup by testing, benchmarking and configuring everything the right way, they should at least state so! They make it sound like an easy addon for your setup on their internet pages on this. In a way it is, but not completely ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants