-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several improvements to ARC shrinking #16197
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Half of these changes are definitely 👍 (I've been running with similar local changes to track and return how much was actually evicted), the rest I feel neutral or suspicious about as commented.
FWIW have you tried zfs_arc_shrinker_limit=0 rather than the more-complicated approach of estimating eviction cost etc? limit=0 allegedly used to cause arc collapse, but I've not been able to trigger than for a long time, at least in combination with eviction code that accounts for how much was actually evicted.
FWIW I think there's yet another possible source for excessive swapping in addition to your observations - it might be caused by too high In our situation, since we run with |
@snajpa Yes, I was also thinking about |
@amotin I haven't looked at the code yet, but if it doesn't do it already, it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail we could also optimize further by saving the last successful order :) and only sometimes (whatever that means for now) go for a higher order |
That is what ZFS does. It tries to allocate big first, but if fails, requests smaller and smaller until get enough. But that way it consumes all remaining big chunks first. |
It actually seems to directly call |
@snajpa Most of ARC capacity is allocated by abd_alloc_chunks() via alloc_pages_node(). |
I've tried
Interestingly, it seems to be called to get always pretty similar amounts of memory - ranging from 273408 to 273856 bytes (?) |
Does this patch address this issue? Can the fix be implemented within the limit of Side question: in general, why it is so difficult to "emulate" the behavior of linux pagecache in respect grow, reclaim and shrink? Thanks. |
I do plan to set it to 0 in our TrueNAS builds, since we control kernel there. But I have no good ideas what to do about upstream, since some Linux kernels tend to request enormous eviction amounts, even though original motivation of its additions should no longer apply to most users. The 10000 default IMHO is extremely low, if any value other than 0 there is correct at all. But I am not touching it for this patch, leaving for later.
This patch is not expected to fix the issue by itself, only polish some moments. As I have told, at this point we have removed MGLRU from our kernels, that helped a lot with excessive swapping, and I am going to set
Because page cache does not use the crippled shrinker KPIs ZFS has to use. All memory pressure in Linux is built around page cache, and everything else is secondary. And the mentioned MGLRU brings it to extreme, that is why we had to disable it, but it is not a long-term solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits above.
I don't know much about the MGLRU (just read the overview) and haven't seen its effect for real, so I don't really have a great sense of what the problems are. But this change looks pretty light, makes sense, and the tuneable allows a little more adjustment as we learn more about it. I'm good with this.
As this patch touches |
This patch actually does nothing about |
096a55c
to
b40085a
Compare
After more thinking I've decided to add one more chunk to this patch. When receiving direct reclaim from file systems (that may be ZFS itself) previous code was just ignoring that request to avoid deadlocks. But if ZFS occupies most of system's RAM, ignoring such requests may cause excessive pressure on other caches and swap, and in longer run may result in OOM killer activation. Instead of ignoring the request I've made it to shrink ARC and kick eviction thread but skip the wait. It may be not perfect, but do we have a better choice? |
I've decided once more reconsider The new code uses PS: Thinking more, with the current |
BTW, some workarounds growing from my MGLRU complains: https://lkml.kernel.org/r/[email protected] . |
I agree. It should actually be ~160 MB (40 MB * 4 sublists), but the results does not change: under memory pressure, ARC force heavy swap and/or OOM. A more reasonable default for |
I don't see any major issues. Could you please rebase on master to get a good FreeBSD test run? |
- When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested. - On Linux when receiving direct reclaim from some file system (that may be ZFS) instead of ignoring request completely, just shrink the ARC, but do not wait for eviction. Waiting there may cause deadlock. Ignoring it as before may put extra pressure on other caches and/or swap, and cause OOM if nothing help. While not waiting may result in more ARC evicted later, and may be too late if OOM killer activate right now, but I hope it to be better than doing nothing at all. - On Linux set arc_no_grow before waiting for reclaim, not after, or it may grow back while we are waiting. - On Linux add new parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems. - Slightly update Linux arc_set_sys_free() math for new kernels. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.
- When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested. - On Linux when receiving direct reclaim from some file system (that may be ZFS) instead of ignoring request completely, just shrink the ARC, but do not wait for eviction. Waiting there may cause deadlock. Ignoring it as before may put extra pressure on other caches and/or swap, and cause OOM if nothing help. While not waiting may result in more ARC evicted later, and may be too late if OOM killer activate right now, but I hope it to be better than doing nothing at all. - On Linux set arc_no_grow before waiting for reclaim, not after, or it may grow back while we are waiting. - On Linux add new parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems. - Slightly update Linux arc_set_sys_free() math for new kernels. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
Motivation and Context
Since same time updating to Linux 6.6 kernel and increasing maximum ARC size in TrueNAS SCALE 24.04, we've started to receive multiple complains from people on excessive swapping, making systems unresponsive. While I attribute significant part of the problem to the new Multi-Gen LRU code enabled in 6.6 kernel (disabling it helps), I ended up with this set of smaller tunings to ZFS side, trying to make it a bit nicer in this terrible environment.
Description
Types of changes
Checklist:
Signed-off-by
.