Several improvements to ARC shrinking #16197

amotin · 2024-05-14T15:56:38Z

Motivation and Context

Since same time updating to Linux 6.6 kernel and increasing maximum ARC size in TrueNAS SCALE 24.04, we've started to receive multiple complains from people on excessive swapping, making systems unresponsive. While I attribute significant part of the problem to the new Multi-Gen LRU code enabled in 6.6 kernel (disabling it helps), I ended up with this set of smaller tunings to ZFS side, trying to make it a bit nicer in this terrible environment.

Description

When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested.
On Linux when receiving direct reclaim from some file system (that may be ZFS) instead of ignoring request completely, just shrink the ARC, but do not wait for eviction. Waiting there may cause deadlock. Ignoring it as before may put extra pressure on other caches and/or swap, and cause OOM if nothing help. While not waiting may result in more ARC evicted later, and may be too late if OOM killer activate right now, but I hope it to be better than doing nothing at all.
On Linux set arc_no_grow before waiting for reclaim, not after, or it may grow back while we are waiting.
On Linux add new parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems.
Slightly update Linux arc_set_sys_free() math for new kernels.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

adamdmoss

Half of these changes are definitely 👍 (I've been running with similar local changes to track and return how much was actually evicted), the rest I feel neutral or suspicious about as commented.
FWIW have you tried zfs_arc_shrinker_limit=0 rather than the more-complicated approach of estimating eviction cost etc? limit=0 allegedly used to cause arc collapse, but I've not been able to trigger than for a long time, at least in combination with eviction code that accounts for how much was actually evicted.

man/man4/zfs.4

module/os/linux/zfs/arc_os.c

module/zfs/arc.c

snajpa · 2024-05-16T11:24:45Z

FWIW I think there's yet another possible source for excessive swapping in addition to your observations - it might be caused by too high zfs_abd_scatter_max_order. In our setup, it only takes a few days until excessive reclaim kicks in, then we have to add a zram-based swap device. When we lower zfs_abd_scatter_max_order to below 3, the excessive reclaim doesn't of course disappear fully, as there are other sources of pressure in the kernel to get higher order buddies, but it is very noticeable (load drops by 100 on a machine with 600 1st level containers and tons more nested).

In our situation, since we run with txg_timeout = 15 and pretty high dirty_data_max so that we really mostly sync on the 15s mark, it's those syncs that trigger a lot of paging out to swap. Using zram has so far mitigated it as we tend to have at least 100G+ free memory, but it's easily available only in 4k chunks...

amotin · 2024-05-16T14:34:32Z

@snajpa Yes, I was also thinking about zfs_abd_scatter_max_order. I don't have own numbers, but my thinking was that on FreeBSD, where ARC allocates only individual PAGE_SIZE pages, it takes from OS the least convenient memory, while on Linux ARC always allocates the best contiguous chunks it can, that leaves other subsystems that are more sensitive to fragmentation to suffer. Contiguous chunks should be good for I/O efficiency, and on FreeBSD I do measure some per-page overheads, but there must be some sweet spot.

snajpa · 2024-05-16T20:07:43Z

@amotin I haven't looked at the code yet, but if it doesn't do it already, it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail

we could also optimize further by saving the last successful order :) and only sometimes (whatever that means for now) go for a higher order

amotin · 2024-05-16T20:10:48Z

it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail

That is what ZFS does. It tries to allocate big first, but if fails, requests smaller and smaller until get enough. But that way it consumes all remaining big chunks first.

snajpa · 2024-05-16T22:01:09Z

It actually seems to directly call kvmalloc() when HAVE_KVMALLOC. In the 6.8 source I'm looking at, kvmalloc seems to do __GFP_NORETRY, for which the documentation says it does one round of reclaim in this implementation. I'm tempted to change that line to kmalloc_flags &= ~__GFP_DIRECT_RECLAIM; to see what happens :D not sure what to do (if anything) on ZFS level with this information though.

amotin · 2024-05-17T15:23:26Z

@snajpa Most of ARC capacity is allocated by abd_alloc_chunks() via alloc_pages_node().

snajpa · 2024-05-17T17:44:07Z

I've tried bpftraceing spl_kvmalloc calls and it seems at least dsl_dir_tempreserve_space and dmu_buf_hold_array_by_dnode are calling spl_kvmalloc (which ends up with one round of reclaim). Running this on a comparatively pretty much idle staging node, yet it's IMHO way too many calls in too little time...

[[email protected]]
 ~ # timeout --foreground 10 bpftrace -e 'kprobe:spl_kvmalloc{ printf("%s: %s(%d)\n", probe, comm, pid); }' | wc -l
231947

Interestingly, it seems to be called to get always pretty similar amounts of memory - ranging from 273408 to 273856 bytes (?)

shodanshok · 2024-06-28T07:29:55Z

zfs_arc_shrinker_limit=10000 (default) seems to strongly favor ARC, forcing heavy swapping even when not needed. Adjusting vm.swappiness has only limited effect (unless setting it to 0).

Does this patch address this issue? Can the fix be implemented within the limit of zfs_arc_shrinker_limit, zfs_arc_shrink_shift and zfs_arc_pc_percent, without introducing yet another tunable? It is becoming quite difficult to tune a system to avoid excessive swap.

Side question: in general, why it is so difficult to "emulate" the behavior of linux pagecache in respect grow, reclaim and shrink?

Thanks.

amotin · 2024-06-28T13:24:18Z

zfs_arc_shrinker_limit=10000 (default) seems to strongly favor ARC, forcing heavy swapping even when not needed. Adjusting vm.swappiness has only limited effect (unless setting it to 0).

I do plan to set it to 0 in our TrueNAS builds, since we control kernel there. But I have no good ideas what to do about upstream, since some Linux kernels tend to request enormous eviction amounts, even though original motivation of its additions should no longer apply to most users. The 10000 default IMHO is extremely low, if any value other than 0 there is correct at all. But I am not touching it for this patch, leaving for later.

Does this patch address this issue? Can the fix be implemented within the limit of zfs_arc_shrinker_limit, zfs_arc_shrink_shift and zfs_arc_pc_percent, without introducing yet another tunable? It is becoming quite difficult to tune a system to avoid excessive swap.

This patch is not expected to fix the issue by itself, only polish some moments. As I have told, at this point we have removed MGLRU from our kernels, that helped a lot with excessive swapping, and I am going to set zfs_arc_shrinker_limit=0 and zfs_arc_pc_percent=300 to make ARC adjust better. The new tunable I've added is more for completeness, I do not insist on it and may remove if there are objections.

Side question: in general, why it is so difficult to "emulate" the behavior of linux pagecache in respect grow, reclaim and shrink?

Because page cache does not use the crippled shrinker KPIs ZFS has to use. All memory pressure in Linux is built around page cache, and everything else is secondary. And the mentioned MGLRU brings it to extreme, that is why we had to disable it, but it is not a long-term solution.

robn

Minor nits above.

I don't know much about the MGLRU (just read the overview) and haven't seen its effect for real, so I don't really have a great sense of what the problems are. But this change looks pretty light, makes sense, and the tuneable allows a little more adjustment as we learn more about it. I'm good with this.

module/os/linux/zfs/arc_os.c

module/zfs/arc.c

module/os/linux/zfs/arc_os.c

shodanshok · 2024-07-05T18:39:06Z

As this patch touches zfs_arc_shrinker_limit, any thoughts regarding #16313 (comment) ? Do you feel comfortable leaving zfs_arc_shrinker_limit=10000? The default value seems too small to me.

amotin · 2024-07-05T21:36:54Z

As this patch touches zfs_arc_shrinker_limit, any thoughts regarding #16313 (comment) ? Do you feel comfortable leaving zfs_arc_shrinker_limit=10000? The default value seems too small to me.

This patch actually does nothing about zfs_arc_shrinker_limit and for a reason. While I don't like the current default, I don't see a good alternative. If I would change it, I would change it to 0 and then try to kick Linux developers to be reasonable. While ZFS uses anything other than 0, it means it does not follow kernel memory pressure requests, and in that situation I see hopeless to try and make kernel cooperate.

amotin · 2024-07-08T19:37:35Z

After more thinking I've decided to add one more chunk to this patch. When receiving direct reclaim from file systems (that may be ZFS itself) previous code was just ignoring that request to avoid deadlocks. But if ZFS occupies most of system's RAM, ignoring such requests may cause excessive pressure on other caches and swap, and in longer run may result in OOM killer activation. Instead of ignoring the request I've made it to shrink ARC and kick eviction thread but skip the wait. It may be not perfect, but do we have a better choice?

amotin · 2024-07-12T19:52:03Z

I've decided once more reconsider arc_is_overflowing(). Previously it made caller never wait for eviction of less than 1/512 of ARC size or SPA_MAXBLOCKSIZE (16MB), whatever is bigger. But Linux starts reclaim process from 1/4096 (see DEF_PRIORITY of 12), which means first several iterations ZFS may not timely react on memory pressure, forcing more eviction from page cache and other caches, which may already be on minimum if most of memory is consumed by ARC.

The new code uses zfs_max_recordsize as a minimum wait threshold under memory pressure, which is still 16MB on 64bit platforms, but only 1MB on 32bit, which should be nicer to the last. Not considering zfs_arc_overflow_shift under pressure allows to be more reactive on a large systems, where 1/512 of ARC may mean gigabytes of RAM, while kernel may need much less, but right now.

PS: Thinking more, with the current zfs_arc_shrinker_limit default of 10000 pages (that means only 40MB memory reclaim at a time under absolutely desperate pressure before OOM killer), I suppose ZFS could almost never react timely on the memory pressure on large systems. This should make it some better, while zfs_arc_shrinker_limit is still evil.

amotin · 2024-07-12T20:35:13Z

BTW, some workarounds growing from my MGLRU complains: https://lkml.kernel.org/r/[email protected] .

shodanshok · 2024-07-13T07:16:23Z

PS: Thinking more, with the current zfs_arc_shrinker_limit default of 10000 pages (that means only 40MB memory reclaim at a time under absolutely desperate pressure before OOM killer), I suppose ZFS could almost never react timely on the memory pressure on large systems. This should make it some better, while zfs_arc_shrinker_limit is still evil.

I agree. It should actually be ~160 MB (40 MB * 4 sublists), but the results does not change: under memory pressure, ARC force heavy swap and/or OOM. A more reasonable default for zfs_arc_shrinker_limit should be in the range of 128K pages, with no limit at all when direct reclaim is requested.

module/os/linux/zfs/arc_os.c

tonyhutter · 2024-07-24T20:11:42Z

I don't see any major issues. Could you please rebase on master to get a good FreeBSD test run?

- When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested. - On Linux when receiving direct reclaim from some file system (that may be ZFS) instead of ignoring request completely, just shrink the ARC, but do not wait for eviction. Waiting there may cause deadlock. Ignoring it as before may put extra pressure on other caches and/or swap, and cause OOM if nothing help. While not waiting may result in more ARC evicted later, and may be too late if OOM killer activate right now, but I hope it to be better than doing nothing at all. - On Linux set arc_no_grow before waiting for reclaim, not after, or it may grow back while we are waiting. - On Linux add new parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems. - Slightly update Linux arc_set_sys_free() math for new kernels. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.

- When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested. - On Linux when receiving direct reclaim from some file system (that may be ZFS) instead of ignoring request completely, just shrink the ARC, but do not wait for eviction. Waiting there may cause deadlock. Ignoring it as before may put extra pressure on other caches and/or swap, and cause OOM if nothing help. While not waiting may result in more ARC evicted later, and may be too late if OOM killer activate right now, but I hope it to be better than doing nothing at all. - On Linux set arc_no_grow before waiting for reclaim, not after, or it may grow back while we are waiting. - On Linux add new parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems. - Slightly update Linux arc_set_sys_free() math for new kernels. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]>

amotin requested review from ahrens and behlendorf May 14, 2024 15:56

amotin added the Status: Code Review Needed Ready for review and testing label May 14, 2024

amotin requested a review from grwilson May 14, 2024 15:57

adamdmoss reviewed May 14, 2024

View reviewed changes

man/man4/zfs.4 Show resolved Hide resolved

module/os/linux/zfs/arc_os.c Show resolved Hide resolved

module/os/linux/zfs/arc_os.c Show resolved Hide resolved

module/zfs/arc.c Outdated Show resolved Hide resolved

amotin force-pushed the shrinker branch from e2b786d to 798a8f6 Compare May 14, 2024 18:05

amotin force-pushed the shrinker branch from 798a8f6 to 1a44e8b Compare June 27, 2024 19:25

robn approved these changes Jul 5, 2024

View reviewed changes

module/os/linux/zfs/arc_os.c Show resolved Hide resolved

module/os/linux/zfs/arc_os.c Show resolved Hide resolved

module/zfs/arc.c Outdated Show resolved Hide resolved

module/os/linux/zfs/arc_os.c Show resolved Hide resolved

amotin force-pushed the shrinker branch from 1a44e8b to 6b46142 Compare July 5, 2024 14:47

amotin force-pushed the shrinker branch 2 times, most recently from 096a55c to b40085a Compare July 8, 2024 19:32

amotin force-pushed the shrinker branch from b40085a to d8ba416 Compare July 12, 2024 19:10

This was referenced Jul 21, 2024

Performance issues with sync=disabled + compress=zstd #16371

Open

zfs 2.2.4 on Linux: RIP: e030:LZ4_uncompress_unknownOutputSize+0x4c4/0x780 [zfs] #16384

Closed

tonyhutter reviewed Jul 24, 2024

View reviewed changes

module/os/linux/zfs/arc_os.c Show resolved Hide resolved

amotin force-pushed the shrinker branch from d8ba416 to 2905b40 Compare July 24, 2024 20:19

tonyhutter approved these changes Jul 25, 2024

View reviewed changes

tonyhutter merged commit 55427ad into openzfs:master Jul 25, 2024
20 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several improvements to ARC shrinking #16197

Several improvements to ARC shrinking #16197

amotin commented May 14, 2024 •

edited

Loading

adamdmoss left a comment

snajpa commented May 16, 2024 •

edited

Loading

amotin commented May 16, 2024

snajpa commented May 16, 2024 •

edited

Loading

amotin commented May 16, 2024

snajpa commented May 16, 2024

amotin commented May 17, 2024

snajpa commented May 17, 2024 •

edited

Loading

shodanshok commented Jun 28, 2024 •

edited

Loading

amotin commented Jun 28, 2024 •

edited

Loading

robn left a comment

shodanshok commented Jul 5, 2024

amotin commented Jul 5, 2024

amotin commented Jul 8, 2024

amotin commented Jul 12, 2024 •

edited

Loading

amotin commented Jul 12, 2024 •

edited

Loading

shodanshok commented Jul 13, 2024

tonyhutter commented Jul 24, 2024

Several improvements to ARC shrinking #16197

Several improvements to ARC shrinking #16197

Conversation

amotin commented May 14, 2024 • edited Loading

Motivation and Context

Description

Types of changes

Checklist:

adamdmoss left a comment

Choose a reason for hiding this comment

snajpa commented May 16, 2024 • edited Loading

amotin commented May 16, 2024

snajpa commented May 16, 2024 • edited Loading

amotin commented May 16, 2024

snajpa commented May 16, 2024

amotin commented May 17, 2024

snajpa commented May 17, 2024 • edited Loading

shodanshok commented Jun 28, 2024 • edited Loading

amotin commented Jun 28, 2024 • edited Loading

robn left a comment

Choose a reason for hiding this comment

shodanshok commented Jul 5, 2024

amotin commented Jul 5, 2024

amotin commented Jul 8, 2024

amotin commented Jul 12, 2024 • edited Loading

amotin commented Jul 12, 2024 • edited Loading

shodanshok commented Jul 13, 2024

tonyhutter commented Jul 24, 2024

amotin commented May 14, 2024 •

edited

Loading

snajpa commented May 16, 2024 •

edited

Loading

snajpa commented May 16, 2024 •

edited

Loading

snajpa commented May 17, 2024 •

edited

Loading

shodanshok commented Jun 28, 2024 •

edited

Loading

amotin commented Jun 28, 2024 •

edited

Loading

amotin commented Jul 12, 2024 •

edited

Loading

amotin commented Jul 12, 2024 •

edited

Loading