-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement cross-shard consumption fairness #2294
base: master
Are you sure you want to change the base?
Conversation
I don't mean to ignore the problem, but I think "very nasty" is an exaggeration. A/0 could have expected 0.44 share of the capacity (1000/2200) and got 0.5. B/1 could expect 0.04 share of the capacity, and got 0.5. So neither has anything to complain about. I suppose with a larger number of shards and only a small number of sg/shard combinations active the problem is more interesting. Especially when the disk isn't too fast. |
Looks reasonable (I don't claim to understand it deeply). How can we assure this doesn't cause regressions? |
Probably
😄 Not quite. Provided A is query class and B is compaction class, 0.44 reads vs 0.04 writes is not the same as 0.5 reads vs 0.5 writes, from diskplorer plots only i4i can handle it.
I will :( Currently, if one shard gets stuck, other shards just continue moving. With this patch, they all will get stuck. |
But isn't the 0.5 normalized? So if the disk can do 6GB/s read and 1GB/s write, 0.5r / 0.5w = 3GB/s reads and 0.5GB/s writes, which should work according to the model. |
Yes, from the pure model perspective it's all OK, but in reality some plots are convex, so the 0.5/0.5 point happens to be "more purple" (larger read latency), while 0.4/0.04 is "more cyan" (smaller read latency) |
207c14a
to
715d4a0
Compare
upd:
|
715d4a0
to
a4bc3e0
Compare
upd:
|
CI fails due to #2296 |
upd:
|
26a8d08
to
a2b8ec6
Compare
upd:
|
The vaulue is used to convert request tokens from float-point number to some "sane" integer. Currently on ~200k IOPS ~2GBs disk it produces the following token values for requests of different sizes 512: 150k 1024: 160k 2048: 170k ... 131072: 1.4M These values are pretty huge an when accumulated even in 64-bit counter can overflow it in months time scale. Current code sort of accounts for that by checking the overflow and resetting the counters, but in the future there will be the need to reset counters on different shards, and that's going to be problematic. This patch reduces the factor 8 times, so that the costs are now 512: 19k 1024: 20k 2048: 21k ... 131072: 170k That's much more friendly to accumulating counters (the overflow is now at the year's scale which is pretty comfortable). Reducing it even further is problematic, here's why. In order to provide cross-class fairness the costs are divided by class shares for accumulation. Given a class of 1000 shares, the 512-bytes request becomes indistinguishable from 1k one with smaller factor. Said that, even with the new factor it's worth taking more care when dividing the cost at shares use div-roundup math. Signed-off-by: Pavel Emelyanov <[email protected]>
Current tests on fair queue try to make the queue submit requests in extremely controllable way -- one-by-one. However, the fair queue nowadays is driven by rated token bucket and is very sensitive to time and durations. It's better to teach the test accept the fact that it cannot control fair-queue requests submissions on per-request granularity and tunes its accounting instead. The change affects two places. Main loop. Before the change it called fair_queue::dispatch_requests() as many times are the number of requests test case wants to pass, then performed the necessary checks. Now, the method is called infinitely, and the handling only processes the requested amount of requests. The rest is ignored. Drain. Before the change it called dispatch_requests() in a loop until it returned anything. Now it's called in a loop until fair queue explicitly reports that it's empty. Signed-off-by: Pavel Emelyanov <[email protected]>
For convenience Signed-off-by: Pavel Emelyanov <[email protected]>
On each shard classes compete with each other by accumulating the sum of request costs that had been dispatched from them so far. Cost is the request capacity divided by the class shares. Dispatch loop then selects the class with the smallest accumulated value, thus providing shares-aware fairless -- the larger the shares value is, the slower the accumulator gorws, the more requests are picked from the class for dispatch. This patch implements similar approach across shards. For that, each shard accumnulates the dispatched cost from all classes. IO group keeps track of a vector of accumulated costs for each shard. When a shard wants to dispatch it first checks if it has run too far ahead of all other shards, and if it does, it skips the dispatch loop. Corner case -- when a queue gets drained, it "withdraws" itself from other shards' decisions by advancing its group counter to infinity. Respectively, when a group comes back it may forward its accumulator not to get too large advantage over other shards. When scheduling classes, shard has exclusive access to them and uses log-complex heap to pick the one with smallest consumption counter. Cross-shard balancing cannot afford it. Instead, each shard manipulates its own counter only, and to compare it with other shards' it scans the whole vector, which is not very cache-friendly and race-prone. Signed-off-by: Pavel Emelyanov <[email protected]>
The value is used to limit one shard in the amount of requests it's allowed to dispatch in one poll. This is to prevent it from consuming the whole capacity in one go and let other shards get their portion. Group-wide balancing (previous patch) made this fuse obsotele. Signed-off-by: Pavel Emelyanov <[email protected]>
Looking at group balance counters is not very lightweight and is better be avoided when possible. For that -- when balance is achieved, arm a timer for quiscent period, and check only after it expires. When the group is not balanced, check balance more frequently. Signed-off-by: Pavel Emelyanov <[email protected]>
It's pretty long, so not for automatic execition 2-shards tests: {'shard_0': {'iops': 88204.3828, 'shares': 100}, 'shard_1': {'iops': 89686.5156, 'shares': 100}} IOPS ratio 1.02, expected 1.0, deviation 1% {'shard_0': {'iops': 60321.3125, 'shares': 100}, 'shard_1': {'iops': 117566.406, 'shares': 200}} IOPS ratio 1.95, expected 2.0, deviation 2% {'shard_0': {'iops': 37326.2422, 'shares': 100}, 'shard_1': {'iops': 140555.062, 'shares': 400}} IOPS ratio 3.77, expected 4.0, deviation 5% {'shard_0': {'iops': 21547.6152, 'shares': 100}, 'shard_1': {'iops': 156309.891, 'shares': 800}} IOPS ratio 7.25, expected 8.0, deviation 9% 3-shards tests: {'shard_0': {'iops': 45211.9336, 'shares': 100}, 'shard_1': {'iops': 45211.9766, 'shares': 100}, 'shard_2': {'iops': 87412.9453, 'shares': 200}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 1.93, expected 2.0, deviation 3% {'shard_0': {'iops': 30992.2188, 'shares': 100}, 'shard_1': {'iops': 30992.2812, 'shares': 100}, 'shard_2': {'iops': 115887.609, 'shares': 400}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 3.74, expected 4.0, deviation 6% {'shard_0': {'iops': 19279.6348, 'shares': 100}, 'shard_1': {'iops': 19279.6934, 'shares': 100}, 'shard_2': {'iops': 139316.828, 'shares': 800}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 7.23, expected 8.0, deviation 9% {'shard_0': {'iops': 26505.9082, 'shares': 100}, 'shard_1': {'iops': 53011.9922, 'shares': 200}, 'shard_2': {'iops': 98369.4453, 'shares': 400}} shard-1 IOPS ratio 2.0, expected 2.0, deviation 0% shard-2 IOPS ratio 3.71, expected 4.0, deviation 7% {'shard_0': {'iops': 17461.8145, 'shares': 100}, 'shard_1': {'iops': 34923.8438, 'shares': 200}, 'shard_2': {'iops': 125470.43, 'shares': 800}} shard-1 IOPS ratio 2.0, expected 2.0, deviation 0% shard-2 IOPS ratio 7.19, expected 8.0, deviation 10% {'shard_0': {'iops': 14812.3037, 'shares': 100}, 'shard_1': {'iops': 58262, 'shares': 400}, 'shard_2': {'iops': 104794.633, 'shares': 800}} shard-1 IOPS ratio 3.93, expected 4.0, deviation 1% shard-2 IOPS ratio 7.07, expected 8.0, deviation 11% Signed-off-by: Pavel Emelyanov <[email protected]>
a2b8ec6
to
3ef9817
Compare
upd:
|
Then we should either adjust the model (and teach iotune to learn it) or add a safety factor that pushes the entire line down. We should match the scheduler to the model and the model to the disk. If we skip a step we'll get lost. |
It's already there -- the
I don't disagree. My "not quite" observation wasn't the justification of this PR, as it doesn't touch the model |
updated the PR description not to fix #1430 , as it still doesn't help high-prio requests to advance in the queue in case of symmetrical (across shards) workload -- shards forward their accumulators equally not giving any advantage to each other. However, the workload from #1430 clearly renders ~10% smaller tail latency for high-prio interactive class (4 shards) On master
this PR
Possible explanation -- high-prio class is low-rate (250 RPS) and the queue is empty, so when one shard emits a request, it forwards its accumulator thus reducing its chance on dispatching requests in the next tick, so the sink has more room for other shards. Still, that's not true preemption #1430 implies |
The class in question only controls the output flow of capacities, it's not about fair queueing at all. There is an effort to make cross-shard fairness, that needs fair_group however, but we're not yet there. refs: scylladb#2294 Signed-off-by: Pavel Emelyanov <[email protected]>
scylladb/seastar#2294 * xemul/br-io-queue-cross-shard-fairness: test: Add manual test for cross-shard balancing fair_queue: Amortize cross-shard balance checking fair_queue: Drop per-dispatch-loop threshold fair_queue: Introduce group-wide capacity balancing fair_queue: Define signed_capacity_t type in fair_group fair_queue tests: Remember it is time-based fair_queue: Scale fixed-point factor
scylladb/seastar#2294 * xemul/br-io-queue-cross-shard-fairness: test: Add manual test for cross-shard balancing fair_queue: Amortize cross-shard balance checking fair_queue: Drop per-dispatch-loop threshold fair_queue: Introduce group-wide capacity balancing fair_queue: Define signed_capacity_t type in fair_group fair_queue tests: Remember it is time-based fair_queue: Scale fixed-point factor
Current IO queues design assumes that IO workload is very uniform on different shards in a sense that -- all classes are more-or-less equally loaded on different shards. In reality that's not true and some shards can easily get more requests in one of its class than the others.
This leads to a very nasty consequence. Consider a corner case -- two classes, A and B, with shares 100 and 1000 respectively, two shards. Class A is active on shard-0 only, while class B is active on shard-1 only. We expect, that they share disk bandwidth capacity in 1:10 proportion, but in reality it's going to be 1:1, because cross-shard queue doesn't preempt (it doesn't, because load is expected to be even on all shards).
The solution here is to implement the cross-class fairness approach on the shard-level. For that, each fair_queue accumulates its total amount of requests costs dispatched (cost = request.capacity / class.shares), and on every poll -- check its accumulator against those of other shards. If the local value is "somewhat ahead of" the others, the poll is skipped until later.
refs: #1430
fixes: #1083
Tested with io_tester on 2-shards setup from #2289
The result is