-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parallel ARC eviction #16486
base: master
Are you sure you want to change the base?
Conversation
4cd510d
to
f45bf2e
Compare
f45bf2e
to
146fe45
Compare
uint64_t nchunks = ((left - 1) >> MIN_EVICT_PERTASK_SHIFT) + 1; | ||
unsigned n = nchunks < num_sublists ? nchunks : num_sublists; | ||
uint64_t fullrows = nchunks / n; | ||
unsigned lastrowcols = nchunks % n; | ||
unsigned k = (lastrowcols ? lastrowcols : n); | ||
|
||
uint64_t bytes_pertask_low = | ||
fullrows << MIN_EVICT_PERTASK_SHIFT; | ||
uint64_t bytes_pertask = bytes_pertask_low + (lastrowcols ? | ||
(1 << MIN_EVICT_PERTASK_SHIFT) : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are over-engineering here. I don't think eviction per taskq should really be a multiple of 1 << MIN_EVICT_PERTASK_SHIFT
to complicate the logic, merely it should be bigger than one. So you could just use MIN_EVICT_PERTASK_SHIFT
to decide number of tasks, and then split the eviction amount equally between them.
And I wonder if it would make sense to scale number of tasks with eviction not linearly, but in some logarithimic fashion to not spin too many threads at once, stressing the system more for diminishing return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite new to taskqs and so I trust your judgement. Could I ask you to elaborate a bit more on how you'd like this logic to look like? Thank you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has nothing to do with taskqs. It simply makes no sense to create multiple parallel jobs/tasks and wake up multiple threads/CPUs for operations below certain size due to overheads. SPA_MAXBLOCKSHIFT
is an absolute minimum there, since it may be impossible to free less than that any way if large blocks are used, but something bigger could be chosen based on some practical tests. So just, as I have told, remove all this unneeded complexity of trying to be multiple to MIN_EVICT_PERTASK_SHIFT
, select number of tasks based on total size (divide total size by the minimum size, rounding up, and limit the result by number of threads), and then give each task its fraction of the total size (just divide total size by number of tasks, rounding up).
I've been casually testing this out (combined with the parallel_dbuf_evict PR) over the last couple of weeks (most recently, 5b070d1 ). I've not been hammering it hard or specifically, just letting it do its thing with my messing-around desktop system. Hit a probable regression today, though: while mv'ing a meager 8GB of files from one pool to another, all my zfs IO got really high-latency, and an iotop showed that the copy part of the move (this being a mv across pools, so in reality it's a copy-and-remove) was running at a painful few 100KB/sec, and the zfs arc_evict thread was taking a whole core... but just one core. In time it all cleared up and of course I can't conclusively blame this PR's changes, but I left with two fuzzy observations:
|
146fe45
to
e128026
Compare
I have updated the patch with a different logic for picking the default maximum number of ARC eviction threads. The new logic aims to pick the number that is one-eighth of the available CPUs, with a minimum of 2 and a maximum of 16. |
Why would we need two evict threads on a single-core system? In that case I would probably prefer to disable taskqs completely. If that is a way to make it more logarithmic, then I would think about |
Right now, this is only enabled by a separate tunable, to enable multiple threads. So for the single CPU case, we don't expect it to be enabled. But for something like 4-12 core systems, we would want it to use at least 2 threads, and then grow from there, reaching 16 threads at 128 cores. |
Now that you mentioned it, I've noticed its been disabled by default. I don't like the idea to tune it manually in production depending on system size. I would prefer to to have reasonable automatic defaults. |
b6a65a2
to
e99733e
Compare
Hey! So, here's what changed in the patch: FormulaThere is now a different formula for automatically scaling the number of evict threads when the parameter is set to
It looks like this (the x axis is the CPU count and the y axis is the evict thread count): Here's also a table:
Less parameters
This approach has been suggested by @tonyhutter in another PR (#16487 (comment)). Stability improvementsIt is no longer possible to modify the actual evict threads count during runtime. Since the evict taskqs are only created during arc_init(), the module saves the actual number of evict threads it is going to use and does not care about changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for automating it. Few comments to that part, and please take a look on my earlier comments.
module/zfs/arc.c
Outdated
zfs_arc_evict_threads_live = MIN(MAX(max_ncpus > 6 ? 2 : 1, | ||
arc_ilog2(max_ncpus) + (max_ncpus >> 6)), 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the MAX(max_ncpus > 6 ? 2 : 1
part? ilog2(6) should already be 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks!
Currently, we get the following thread counts:
CPU count | Resulting thread count |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 2 |
... | .... |
So you are right that the MAX(max_ncpus > 6 ? 2 : 1
part is not working as expected currently. If we want to stick to 1 thread for 4 and 5 CPUs then we'd need to use the following formula:
// Version 2a
MIN(max_ncpus < 6 ? 1 : arc_ilog2(max_ncpus) + (max_ncpus >> 6), 16);
If we decide to simplify that further and go for 2 threads on systems with 4 or 5 CPUs, then we can just use:
// Version 2b
MAX(1, MIN(arc_ilog2(max_ncpus) + (max_ncpus >> 6), 16));
Version 2b looks good to me and is certainly easier to reason about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2b looks cleaner, but 2 threads out of 4 sound a bit overkill to me. May be the curve could be thought more. BTW, speaking about more clean (readable) code, this is not performance-critical part and we are not in 1990's, there is no point to use bit shifts for division, you may just use / 64
and compilers will do it right. And then you would not need parenthesis around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback! I've cleaned up the formula. Now, we use 1 thread for less than 6 CPUs and then MIN((highbit64(max_ncpus) - 1) + max_ncpus / 64, 16)
. for larger systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Just it seems like you are still creating a taskq with one thread, but never using it. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Thank you for pointing that out.
I've fixed that. Now all evict-taskq-related bits are wrapped with a check if the use of the taskq is on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That's fine, but I personally would check for arc_evict_taskq
being NULL rather than !live > 1
.
module/zfs/arc.c
Outdated
@@ -4071,25 +4117,108 @@ arc_evict_state(arc_state_t *state, arc_buf_contents_t type, uint64_t spa, | |||
multilist_sublist_unlock(mls); | |||
} | |||
|
|||
evict_arg_t *evarg = kmem_alloc(sizeof (*evarg) * num_sublists, | |||
KM_SLEEP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleepable memory allocation in eviction path is a request for potential troubles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed that to NOSLEEP. Now, if we cannot allocate the memory, we just fall back to the regular single evict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That is one way to solve it. Or we could pre-allocate it similar to markers.
@@ -7809,6 +7963,8 @@ arc_init(void) | |||
void | |||
arc_fini(void) | |||
{ | |||
boolean_t useevicttaskq = zfs_arc_evict_threads_live > 1; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra new line.
I am not sure it is right, but it seems GCC does no like it:
|
Read and write performance can become limited by the arc_evict process being single threaded. Additional data cannot be added to the ARC until sufficient existing data is evicted. On many-core systems with TBs of RAM, a single thread becomes a significant bottleneck. With the change we see a 25% increase in read and write throughput Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Mateusz Piotrowski <[email protected]> Signed-off-by: Alexander Stetsenko <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]>
- Improve the description of the scaling algorithm in the manual page. Signed-off-by: Mateusz Piotrowski <[email protected]>
This parameter cannot be changed during runtime anyway in any meaningful way. Make it explicitly read-only. The manual page does not need to be updated. It already mentions that the thread count cannot be changed during runtime.
- Use a simple division instead of a bit shift for better readability. - Make sure that systems with less than 6 CPUs auto-scale to 1 eviction thread.
d899eaf
to
3218719
Compare
Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Motivation and Context
Read and write performance can become limited by the arc_evict process being single threaded.
Additional data cannot be added to the ARC until sufficient existing data is evicted.
On many-core systems with TBs of RAM, a single thread becomes a significant bottleneck.
With the change we see a 25% increase in read and write throughput
Description
Use a new taskq to run multiple multiple
arc_evict()
threads at once, each given a fraction of the desired memory to reclaimHow Has This Been Tested?
Benchmarking with a full ARC to measure the performance difference.
Types of changes
Checklist:
Signed-off-by
.