Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several improvements to ARC shrinking #16197

Merged
merged 1 commit into from
Jul 25, 2024
Merged

Conversation

amotin
Copy link
Member

@amotin amotin commented May 14, 2024

Motivation and Context

Since same time updating to Linux 6.6 kernel and increasing maximum ARC size in TrueNAS SCALE 24.04, we've started to receive multiple complains from people on excessive swapping, making systems unresponsive. While I attribute significant part of the problem to the new Multi-Gen LRU code enabled in 6.6 kernel (disabling it helps), I ended up with this set of smaller tunings to ZFS side, trying to make it a bit nicer in this terrible environment.

Description

  • When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested.
  • On Linux when receiving direct reclaim from some file system (that may be ZFS) instead of ignoring request completely, just shrink the ARC, but do not wait for eviction. Waiting there may cause deadlock. Ignoring it as before may put extra pressure on other caches and/or swap, and cause OOM if nothing help. While not waiting may result in more ARC evicted later, and may be too late if OOM killer activate right now, but I hope it to be better than doing nothing at all.
  • On Linux set arc_no_grow before waiting for reclaim, not after, or it may grow back while we are waiting.
  • On Linux add new parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems.
  • Slightly update Linux arc_set_sys_free() math for new kernels.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin requested review from ahrens and behlendorf May 14, 2024 15:56
@amotin amotin added the Status: Code Review Needed Ready for review and testing label May 14, 2024
@amotin amotin requested a review from grwilson May 14, 2024 15:57
Copy link
Contributor

@adamdmoss adamdmoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Half of these changes are definitely 👍 (I've been running with similar local changes to track and return how much was actually evicted), the rest I feel neutral or suspicious about as commented.
FWIW have you tried zfs_arc_shrinker_limit=0 rather than the more-complicated approach of estimating eviction cost etc? limit=0 allegedly used to cause arc collapse, but I've not been able to trigger than for a long time, at least in combination with eviction code that accounts for how much was actually evicted.

man/man4/zfs.4 Show resolved Hide resolved
module/os/linux/zfs/arc_os.c Show resolved Hide resolved
module/os/linux/zfs/arc_os.c Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
@snajpa
Copy link
Contributor

snajpa commented May 16, 2024

FWIW I think there's yet another possible source for excessive swapping in addition to your observations - it might be caused by too high zfs_abd_scatter_max_order. In our setup, it only takes a few days until excessive reclaim kicks in, then we have to add a zram-based swap device. When we lower zfs_abd_scatter_max_order to below 3, the excessive reclaim doesn't of course disappear fully, as there are other sources of pressure in the kernel to get higher order buddies, but it is very noticeable (load drops by 100 on a machine with 600 1st level containers and tons more nested).

In our situation, since we run with txg_timeout = 15 and pretty high dirty_data_max so that we really mostly sync on the 15s mark, it's those syncs that trigger a lot of paging out to swap. Using zram has so far mitigated it as we tend to have at least 100G+ free memory, but it's easily available only in 4k chunks...

@amotin
Copy link
Member Author

amotin commented May 16, 2024

@snajpa Yes, I was also thinking about zfs_abd_scatter_max_order. I don't have own numbers, but my thinking was that on FreeBSD, where ARC allocates only individual PAGE_SIZE pages, it takes from OS the least convenient memory, while on Linux ARC always allocates the best contiguous chunks it can, that leaves other subsystems that are more sensitive to fragmentation to suffer. Contiguous chunks should be good for I/O efficiency, and on FreeBSD I do measure some per-page overheads, but there must be some sweet spot.

@snajpa
Copy link
Contributor

snajpa commented May 16, 2024

@amotin I haven't looked at the code yet, but if it doesn't do it already, it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail

we could also optimize further by saving the last successful order :) and only sometimes (whatever that means for now) go for a higher order

@amotin
Copy link
Member Author

amotin commented May 16, 2024

it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail

That is what ZFS does. It tries to allocate big first, but if fails, requests smaller and smaller until get enough. But that way it consumes all remaining big chunks first.

@snajpa
Copy link
Contributor

snajpa commented May 16, 2024

It actually seems to directly call kvmalloc() when HAVE_KVMALLOC. In the 6.8 source I'm looking at, kvmalloc seems to do __GFP_NORETRY, for which the documentation says it does one round of reclaim in this implementation. I'm tempted to change that line to kmalloc_flags &= ~__GFP_DIRECT_RECLAIM; to see what happens :D not sure what to do (if anything) on ZFS level with this information though.

@amotin
Copy link
Member Author

amotin commented May 17, 2024

@snajpa Most of ARC capacity is allocated by abd_alloc_chunks() via alloc_pages_node().

@snajpa
Copy link
Contributor

snajpa commented May 17, 2024

I've tried bpftraceing spl_kvmalloc calls and it seems at least dsl_dir_tempreserve_space and dmu_buf_hold_array_by_dnode are calling spl_kvmalloc (which ends up with one round of reclaim). Running this on a comparatively pretty much idle staging node, yet it's IMHO way too many calls in too little time...

[[email protected]]
 ~ # timeout --foreground 10 bpftrace -e 'kprobe:spl_kvmalloc{ printf("%s: %s(%d)\n", probe, comm, pid); }' | wc -l
231947

Interestingly, it seems to be called to get always pretty similar amounts of memory - ranging from 273408 to 273856 bytes (?)

@shodanshok
Copy link
Contributor

shodanshok commented Jun 28, 2024

zfs_arc_shrinker_limit=10000 (default) seems to strongly favor ARC, forcing heavy swapping even when not needed. Adjusting vm.swappiness has only limited effect (unless setting it to 0).

Does this patch address this issue? Can the fix be implemented within the limit of zfs_arc_shrinker_limit, zfs_arc_shrink_shift and zfs_arc_pc_percent, without introducing yet another tunable? It is becoming quite difficult to tune a system to avoid excessive swap.

Side question: in general, why it is so difficult to "emulate" the behavior of linux pagecache in respect grow, reclaim and shrink?

Thanks.

@amotin
Copy link
Member Author

amotin commented Jun 28, 2024

zfs_arc_shrinker_limit=10000 (default) seems to strongly favor ARC, forcing heavy swapping even when not needed. Adjusting vm.swappiness has only limited effect (unless setting it to 0).

I do plan to set it to 0 in our TrueNAS builds, since we control kernel there. But I have no good ideas what to do about upstream, since some Linux kernels tend to request enormous eviction amounts, even though original motivation of its additions should no longer apply to most users. The 10000 default IMHO is extremely low, if any value other than 0 there is correct at all. But I am not touching it for this patch, leaving for later.

Does this patch address this issue? Can the fix be implemented within the limit of zfs_arc_shrinker_limit, zfs_arc_shrink_shift and zfs_arc_pc_percent, without introducing yet another tunable? It is becoming quite difficult to tune a system to avoid excessive swap.

This patch is not expected to fix the issue by itself, only polish some moments. As I have told, at this point we have removed MGLRU from our kernels, that helped a lot with excessive swapping, and I am going to set zfs_arc_shrinker_limit=0 and zfs_arc_pc_percent=300 to make ARC adjust better. The new tunable I've added is more for completeness, I do not insist on it and may remove if there are objections.

Side question: in general, why it is so difficult to "emulate" the behavior of linux pagecache in respect grow, reclaim and shrink?

Because page cache does not use the crippled shrinker KPIs ZFS has to use. All memory pressure in Linux is built around page cache, and everything else is secondary. And the mentioned MGLRU brings it to extreme, that is why we had to disable it, but it is not a long-term solution.

Copy link
Member

@robn robn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits above.

I don't know much about the MGLRU (just read the overview) and haven't seen its effect for real, so I don't really have a great sense of what the problems are. But this change looks pretty light, makes sense, and the tuneable allows a little more adjustment as we learn more about it. I'm good with this.

module/os/linux/zfs/arc_os.c Show resolved Hide resolved
module/os/linux/zfs/arc_os.c Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
module/os/linux/zfs/arc_os.c Show resolved Hide resolved
@shodanshok
Copy link
Contributor

As this patch touches zfs_arc_shrinker_limit, any thoughts regarding #16313 (comment) ? Do you feel comfortable leaving zfs_arc_shrinker_limit=10000? The default value seems too small to me.

@amotin
Copy link
Member Author

amotin commented Jul 5, 2024

As this patch touches zfs_arc_shrinker_limit, any thoughts regarding #16313 (comment) ? Do you feel comfortable leaving zfs_arc_shrinker_limit=10000? The default value seems too small to me.

This patch actually does nothing about zfs_arc_shrinker_limit and for a reason. While I don't like the current default, I don't see a good alternative. If I would change it, I would change it to 0 and then try to kick Linux developers to be reasonable. While ZFS uses anything other than 0, it means it does not follow kernel memory pressure requests, and in that situation I see hopeless to try and make kernel cooperate.

@amotin amotin force-pushed the shrinker branch 2 times, most recently from 096a55c to b40085a Compare July 8, 2024 19:32
@amotin
Copy link
Member Author

amotin commented Jul 8, 2024

After more thinking I've decided to add one more chunk to this patch. When receiving direct reclaim from file systems (that may be ZFS itself) previous code was just ignoring that request to avoid deadlocks. But if ZFS occupies most of system's RAM, ignoring such requests may cause excessive pressure on other caches and swap, and in longer run may result in OOM killer activation. Instead of ignoring the request I've made it to shrink ARC and kick eviction thread but skip the wait. It may be not perfect, but do we have a better choice?

@amotin
Copy link
Member Author

amotin commented Jul 12, 2024

I've decided once more reconsider arc_is_overflowing(). Previously it made caller never wait for eviction of less than 1/512 of ARC size or SPA_MAXBLOCKSIZE (16MB), whatever is bigger. But Linux starts reclaim process from 1/4096 (see DEF_PRIORITY of 12), which means first several iterations ZFS may not timely react on memory pressure, forcing more eviction from page cache and other caches, which may already be on minimum if most of memory is consumed by ARC.

The new code uses zfs_max_recordsize as a minimum wait threshold under memory pressure, which is still 16MB on 64bit platforms, but only 1MB on 32bit, which should be nicer to the last. Not considering zfs_arc_overflow_shift under pressure allows to be more reactive on a large systems, where 1/512 of ARC may mean gigabytes of RAM, while kernel may need much less, but right now.

PS: Thinking more, with the current zfs_arc_shrinker_limit default of 10000 pages (that means only 40MB memory reclaim at a time under absolutely desperate pressure before OOM killer), I suppose ZFS could almost never react timely on the memory pressure on large systems. This should make it some better, while zfs_arc_shrinker_limit is still evil.

@amotin
Copy link
Member Author

amotin commented Jul 12, 2024

BTW, some workarounds growing from my MGLRU complains: https://lkml.kernel.org/r/[email protected] .

@shodanshok
Copy link
Contributor

PS: Thinking more, with the current zfs_arc_shrinker_limit default of 10000 pages (that means only 40MB memory reclaim at a time under absolutely desperate pressure before OOM killer), I suppose ZFS could almost never react timely on the memory pressure on large systems. This should make it some better, while zfs_arc_shrinker_limit is still evil.

I agree. It should actually be ~160 MB (40 MB * 4 sublists), but the results does not change: under memory pressure, ARC force heavy swap and/or OOM. A more reasonable default for zfs_arc_shrinker_limit should be in the range of 128K pages, with no limit at all when direct reclaim is requested.

@tonyhutter
Copy link
Contributor

I don't see any major issues. Could you please rebase on master to get a good FreeBSD test run?

 - When receiving memory pressure signal from OS be more strict
trying to free some memory.  Otherwise kernel may come again and
request much more.  Return as result how much arc_c was actually
reduced due to this request, that may be less than requested.
 - On Linux when receiving direct reclaim from some file system
(that may be ZFS) instead of ignoring request completely, just
shrink the ARC, but do not wait for eviction.  Waiting there may
cause deadlock.  Ignoring it as before may put extra pressure on
other caches and/or swap, and cause OOM if nothing help.  While
not waiting may result in more ARC evicted later, and may be too
late if OOM killer activate right now, but I hope it to be better
than doing nothing at all.
 - On Linux set arc_no_grow before waiting for reclaim, not after,
or it may grow back while we are waiting.
 - On Linux add new parameter zfs_arc_shrinker_seeks to balance
ARC eviction cost, relative to page cache and other subsystems.
 - Slightly update Linux arc_set_sys_free() math for new kernels.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
@tonyhutter tonyhutter merged commit 55427ad into openzfs:master Jul 25, 2024
20 of 25 checks passed
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
- When receiving memory pressure signal from OS be more strict
trying to free some memory.  Otherwise kernel may come again and
request much more.  Return as result how much arc_c was actually
reduced due to this request, that may be less than requested.
 - On Linux when receiving direct reclaim from some file system
(that may be ZFS) instead of ignoring request completely, just
shrink the ARC, but do not wait for eviction.  Waiting there may
cause deadlock.  Ignoring it as before may put extra pressure on
other caches and/or swap, and cause OOM if nothing help.  While
not waiting may result in more ARC evicted later, and may be too
late if OOM killer activate right now, but I hope it to be better
than doing nothing at all.
 - On Linux set arc_no_grow before waiting for reclaim, not after,
or it may grow back while we are waiting.
 - On Linux add new parameter zfs_arc_shrinker_seeks to balance
ARC eviction cost, relative to page cache and other subsystems.
 - Slightly update Linux arc_set_sys_free() math for new kernels.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants