Implement cross-shard consumption fairness #2294

xemul · 2024-06-13T16:42:52Z

Current IO queues design assumes that IO workload is very uniform on different shards in a sense that -- all classes are more-or-less equally loaded on different shards. In reality that's not true and some shards can easily get more requests in one of its class than the others.

This leads to a very nasty consequence. Consider a corner case -- two classes, A and B, with shares 100 and 1000 respectively, two shards. Class A is active on shard-0 only, while class B is active on shard-1 only. We expect, that they share disk bandwidth capacity in 1:10 proportion, but in reality it's going to be 1:1, because cross-shard queue doesn't preempt (it doesn't, because load is expected to be even on all shards).

The solution here is to implement the cross-class fairness approach on the shard-level. For that, each fair_queue accumulates its total amount of requests costs dispatched (cost = request.capacity / class.shares), and on every poll -- check its accumulator against those of other shards. If the local value is "somewhat ahead of" the others, the poll is skipped until later.

refs: #1430
fixes: #1083

Tested with io_tester on 2-shards setup from #2289

- name: shard_0
  shard_info:
    parallelism: 32
    reqsize: 4kB
    shares: 100
  shards:
  - '0'
  type: randread
- name: shard_1
  shard_info:
    parallelism: 32
    reqsize: 4kB
    shares: 800
  shards:
  - '1'
  type: randread

The result is

---
- shard: 0
  shard_0:
    throughput: 89627.8828  # kB/s
    IOPS: 22407.0703
    latencies:  # usec
      average: 1427
      p0.5: 1627
      p0.95: 1653
      p0.99: 1674
      p0.999: 1776
      max: 73161
    stats:
      total_requests: 672223
      io_queue_total_exec_sec: 38.740494826001402
      io_queue_total_delay_sec: 920.08525591999751
      io_queue_total_operations: 672255
      io_queue_starvation_time_sec: 0.86344176700005526
      io_queue_consumption: 3.7791568234562876
      io_queue_adjusted_consumption: 0.038457549512386321
- shard: 1
  shard_1:
    throughput: 621912.375  # kB/s
    IOPS: 155478.797
    latencies:  # usec
      average: 205
      p0.5: 199
      p0.95: 227
      p0.99: 271
      p0.999: 391
      max: 1470
    stats:
      total_requests: 4664398
      io_queue_total_exec_sec: 278.93417872600429
      io_queue_total_delay_sec: 665.67777301317381
      io_queue_total_operations: 4664430
      io_queue_starvation_time_sec: 5.5724000000000004e-05
      io_queue_consumption: 26.221615996956825
      io_queue_adjusted_consumption: 0.03324427980184555
...

avikivity · 2024-06-13T17:35:23Z

This leads to a very nasty consequence. Consider a corner case -- two classes, A and B, with shares 100 and 1000 respectively, two shards. Class A is active on shard-0 only, while class B is active on shard-1 only. We expect, that they share disk bandwidth capacity in 1:10 proportion, but in reality it's going to be 1:1, because cross-shard queue doesn't preempt (it doesn't, because load is expected to be even on all shards).

I don't mean to ignore the problem, but I think "very nasty" is an exaggeration. A/0 could have expected 0.44 share of the capacity (1000/2200) and got 0.5. B/1 could expect 0.04 share of the capacity, and got 0.5. So neither has anything to complain about.

I suppose with a larger number of shards and only a small number of sg/shard combinations active the problem is more interesting. Especially when the disk isn't too fast.

avikivity · 2024-06-13T17:38:45Z

Looks reasonable (I don't claim to understand it deeply).

How can we assure this doesn't cause regressions?

xemul · 2024-06-13T18:08:45Z

I don't mean to ignore the problem, but I think "very nasty" is an exaggeration.

Probably

A/0 could have expected 0.44 share of the capacity (1000/2200) and got 0.5. B/1 could expect 0.04 share of the capacity, and got 0.5. So neither has anything to complain about.

😄

Not quite. Provided A is query class and B is compaction class, 0.44 reads vs 0.04 writes is not the same as 0.5 reads vs 0.5 writes, from diskplorer plots only i4i can handle it.

How can we assure this doesn't cause regressions?

I will :( Currently, if one shard gets stuck, other shards just continue moving. With this patch, they all will get stuck.

avikivity · 2024-06-16T14:34:41Z

I don't mean to ignore the problem, but I think "very nasty" is an exaggeration.

Probably

A/0 could have expected 0.44 share of the capacity (1000/2200) and got 0.5. B/1 could expect 0.04 share of the capacity, and got 0.5. So neither has anything to complain about.

😄

Not quite. Provided A is query class and B is compaction class, 0.44 reads vs 0.04 writes is not the same as 0.5 reads vs 0.5 writes, from diskplorer plots only i4i can handle it.

But isn't the 0.5 normalized? So if the disk can do 6GB/s read and 1GB/s write, 0.5r / 0.5w = 3GB/s reads and 0.5GB/s writes, which should work according to the model.

xemul · 2024-06-17T04:22:49Z

So if the disk can do 6GB/s read and 1GB/s write, 0.5r / 0.5w = 3GB/s reads and 0.5GB/s writes, which should work according to the model.

Yes, from the pure model perspective it's all OK, but in reality some plots are convex, so the 0.5/0.5 point happens to be "more purple" (larger read latency), while 0.4/0.04 is "more cyan" (smaller read latency)

xemul · 2024-06-17T08:34:39Z

upd:

scale tokens 8 times to avoid overflows
fix f.q. tests (hopefully) to remember that fair queue does time-based dispatching

xemul · 2024-06-17T09:00:04Z

upd:

lost unused variable caused compilation to fail

xemul · 2024-06-17T10:14:23Z

CI fails due to #2296

xemul · 2024-06-17T11:41:52Z

upd:

add manual (quite long) test for check cross-shard fairness

xemul · 2024-06-18T14:53:28Z

upd:

rebased to check CI passes

The vaulue is used to convert request tokens from float-point number to some "sane" integer. Currently on ~200k IOPS ~2GBs disk it produces the following token values for requests of different sizes 512: 150k 1024: 160k 2048: 170k ... 131072: 1.4M These values are pretty huge an when accumulated even in 64-bit counter can overflow it in months time scale. Current code sort of accounts for that by checking the overflow and resetting the counters, but in the future there will be the need to reset counters on different shards, and that's going to be problematic. This patch reduces the factor 8 times, so that the costs are now 512: 19k 1024: 20k 2048: 21k ... 131072: 170k That's much more friendly to accumulating counters (the overflow is now at the year's scale which is pretty comfortable). Reducing it even further is problematic, here's why. In order to provide cross-class fairness the costs are divided by class shares for accumulation. Given a class of 1000 shares, the 512-bytes request becomes indistinguishable from 1k one with smaller factor. Said that, even with the new factor it's worth taking more care when dividing the cost at shares use div-roundup math. Signed-off-by: Pavel Emelyanov <[email protected]>

Current tests on fair queue try to make the queue submit requests in extremely controllable way -- one-by-one. However, the fair queue nowadays is driven by rated token bucket and is very sensitive to time and durations. It's better to teach the test accept the fact that it cannot control fair-queue requests submissions on per-request granularity and tunes its accounting instead. The change affects two places. Main loop. Before the change it called fair_queue::dispatch_requests() as many times are the number of requests test case wants to pass, then performed the necessary checks. Now, the method is called infinitely, and the handling only processes the requested amount of requests. The rest is ignored. Drain. Before the change it called dispatch_requests() in a loop until it returned anything. Now it's called in a loop until fair queue explicitly reports that it's empty. Signed-off-by: Pavel Emelyanov <[email protected]>

For convenience Signed-off-by: Pavel Emelyanov <[email protected]>

On each shard classes compete with each other by accumulating the sum of request costs that had been dispatched from them so far. Cost is the request capacity divided by the class shares. Dispatch loop then selects the class with the smallest accumulated value, thus providing shares-aware fairless -- the larger the shares value is, the slower the accumulator gorws, the more requests are picked from the class for dispatch. This patch implements similar approach across shards. For that, each shard accumnulates the dispatched cost from all classes. IO group keeps track of a vector of accumulated costs for each shard. When a shard wants to dispatch it first checks if it has run too far ahead of all other shards, and if it does, it skips the dispatch loop. Corner case -- when a queue gets drained, it "withdraws" itself from other shards' decisions by advancing its group counter to infinity. Respectively, when a group comes back it may forward its accumulator not to get too large advantage over other shards. When scheduling classes, shard has exclusive access to them and uses log-complex heap to pick the one with smallest consumption counter. Cross-shard balancing cannot afford it. Instead, each shard manipulates its own counter only, and to compare it with other shards' it scans the whole vector, which is not very cache-friendly and race-prone. Signed-off-by: Pavel Emelyanov <[email protected]>

The value is used to limit one shard in the amount of requests it's allowed to dispatch in one poll. This is to prevent it from consuming the whole capacity in one go and let other shards get their portion. Group-wide balancing (previous patch) made this fuse obsotele. Signed-off-by: Pavel Emelyanov <[email protected]>

Looking at group balance counters is not very lightweight and is better be avoided when possible. For that -- when balance is achieved, arm a timer for quiscent period, and check only after it expires. When the group is not balanced, check balance more frequently. Signed-off-by: Pavel Emelyanov <[email protected]>

It's pretty long, so not for automatic execition 2-shards tests: {'shard_0': {'iops': 88204.3828, 'shares': 100}, 'shard_1': {'iops': 89686.5156, 'shares': 100}} IOPS ratio 1.02, expected 1.0, deviation 1% {'shard_0': {'iops': 60321.3125, 'shares': 100}, 'shard_1': {'iops': 117566.406, 'shares': 200}} IOPS ratio 1.95, expected 2.0, deviation 2% {'shard_0': {'iops': 37326.2422, 'shares': 100}, 'shard_1': {'iops': 140555.062, 'shares': 400}} IOPS ratio 3.77, expected 4.0, deviation 5% {'shard_0': {'iops': 21547.6152, 'shares': 100}, 'shard_1': {'iops': 156309.891, 'shares': 800}} IOPS ratio 7.25, expected 8.0, deviation 9% 3-shards tests: {'shard_0': {'iops': 45211.9336, 'shares': 100}, 'shard_1': {'iops': 45211.9766, 'shares': 100}, 'shard_2': {'iops': 87412.9453, 'shares': 200}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 1.93, expected 2.0, deviation 3% {'shard_0': {'iops': 30992.2188, 'shares': 100}, 'shard_1': {'iops': 30992.2812, 'shares': 100}, 'shard_2': {'iops': 115887.609, 'shares': 400}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 3.74, expected 4.0, deviation 6% {'shard_0': {'iops': 19279.6348, 'shares': 100}, 'shard_1': {'iops': 19279.6934, 'shares': 100}, 'shard_2': {'iops': 139316.828, 'shares': 800}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 7.23, expected 8.0, deviation 9% {'shard_0': {'iops': 26505.9082, 'shares': 100}, 'shard_1': {'iops': 53011.9922, 'shares': 200}, 'shard_2': {'iops': 98369.4453, 'shares': 400}} shard-1 IOPS ratio 2.0, expected 2.0, deviation 0% shard-2 IOPS ratio 3.71, expected 4.0, deviation 7% {'shard_0': {'iops': 17461.8145, 'shares': 100}, 'shard_1': {'iops': 34923.8438, 'shares': 200}, 'shard_2': {'iops': 125470.43, 'shares': 800}} shard-1 IOPS ratio 2.0, expected 2.0, deviation 0% shard-2 IOPS ratio 7.19, expected 8.0, deviation 10% {'shard_0': {'iops': 14812.3037, 'shares': 100}, 'shard_1': {'iops': 58262, 'shares': 400}, 'shard_2': {'iops': 104794.633, 'shares': 800}} shard-1 IOPS ratio 3.93, expected 4.0, deviation 1% shard-2 IOPS ratio 7.07, expected 8.0, deviation 11% Signed-off-by: Pavel Emelyanov <[email protected]>

xemul · 2024-06-20T15:33:41Z

upd:

added (manual) test for two shards running at different RPS to show that cross-shard balancing doesn't break those two (spoiler: no breakage 🎉 )
added amortization for group's balance counters scan

avikivity · 2024-06-20T15:38:34Z

So if the disk can do 6GB/s read and 1GB/s write, 0.5r / 0.5w = 3GB/s reads and 0.5GB/s writes, which should work according to the model.

Yes, from the pure model perspective it's all OK, but in reality some plots are convex, so the 0.5/0.5 point happens to be "more purple" (larger read latency), while 0.4/0.04 is "more cyan" (smaller read latency)

Then we should either adjust the model (and teach iotune to learn it) or add a safety factor that pushes the entire line down.

We should match the scheduler to the model and the model to the disk. If we skip a step we'll get lost.

xemul · 2024-06-21T07:14:02Z

Then we should either adjust the model (and teach iotune to learn it) or add a safety factor that pushes the entire line down.

It's already there -- the rate_factor parameter in io-properties.yaml is this K parameter from the model equation

     * Throttling formula:
     *
     *    Bw/Bw_max + Br/Br_max + Ow/Ow_max + Or/Or_max <= K

We should match the scheduler to the model and the model to the disk. If we skip a step we'll get lost.

I don't disagree. My "not quite" observation wasn't the justification of this PR, as it doesn't touch the model

xemul · 2024-06-21T07:22:15Z

updated the PR description not to fix #1430 , as it still doesn't help high-prio requests to advance in the queue in case of symmetrical (across shards) workload -- shards forward their accumulators equally not giving any advantage to each other.

However, the workload from #1430 clearly renders ~10% smaller tail latency for high-prio interactive class (4 shards)

On master

    IOPS: 244.480713
      p0.5: 318
      p0.95: 532
      p0.99: 1873
      p0.999: 4863
    IOPS: 252.025391
      p0.5: 318
      p0.95: 511
      p0.99: 1882
      p0.999: 4186
    IOPS: 250.350845
      p0.5: 319
      p0.95: 537
      p0.99: 2584
      p0.999: 4942
    IOPS: 245.260681
      p0.5: 315
      p0.95: 555
      p0.99: 2164
      p0.999: 4563

this PR

    IOPS: 245.461426
      p0.5: 323
      p0.95: 488
      p0.99: 1555
      p0.999: 3950
    IOPS: 244.498688
      p0.5: 320
      p0.95: 526
      p0.99: 1562
      p0.999: 3649
    IOPS: 250.401276
      p0.5: 321
      p0.95: 501
      p0.99: 2013
      p0.999: 4083
    IOPS: 247.366104
      p0.5: 318
      p0.95: 513
      p0.99: 2093
      p0.999: 4005

Possible explanation -- high-prio class is low-rate (250 RPS) and the queue is empty, so when one shard emits a request, it forwards its accumulator thus reducing its chance on dispatching requests in the next tick, so the sink has more room for other shards.

Still, that's not true preemption #1430 implies

The class in question only controls the output flow of capacities, it's not about fair queueing at all. There is an effort to make cross-shard fairness, that needs fair_group however, but we're not yet there. refs: scylladb#2294 Signed-off-by: Pavel Emelyanov <[email protected]>

scylladb/seastar#2294 * xemul/br-io-queue-cross-shard-fairness: test: Add manual test for cross-shard balancing fair_queue: Amortize cross-shard balance checking fair_queue: Drop per-dispatch-loop threshold fair_queue: Introduce group-wide capacity balancing fair_queue: Define signed_capacity_t type in fair_group fair_queue tests: Remember it is time-based fair_queue: Scale fixed-point factor

xemul requested review from avikivity and vladzcloudius June 13, 2024 16:42

xemul mentioned this pull request Jun 13, 2024

[RFC] Scatter IO request among shards evenly #2289

Closed

xemul force-pushed the br-io-queue-cross-shard-fairness branch from 207c14a to 715d4a0 Compare June 17, 2024 08:33

xemul force-pushed the br-io-queue-cross-shard-fairness branch from 715d4a0 to a4bc3e0 Compare June 17, 2024 08:59

xemul force-pushed the br-io-queue-cross-shard-fairness branch from 26a8d08 to a2b8ec6 Compare June 18, 2024 14:53

xemul added 7 commits June 20, 2024 17:40

fair_queue: Define signed_capacity_t type in fair_group

ecedc02

For convenience Signed-off-by: Pavel Emelyanov <[email protected]>

xemul force-pushed the br-io-queue-cross-shard-fairness branch from a2b8ec6 to 3ef9817 Compare June 20, 2024 15:32

xemul changed the title ~~[RFC] Implement cross-shard consumption fairness~~ Implement cross-shard consumption fairness Jun 20, 2024

xemul mentioned this pull request Jul 9, 2024

Move output IO throttler to IO queue level #2332

Draft

xemul mentioned this pull request Aug 20, 2024

io_queue: Oversubscribe to drain imbalance faster #1522

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cross-shard consumption fairness #2294

Implement cross-shard consumption fairness #2294

xemul commented Jun 13, 2024 •

edited

Loading

avikivity commented Jun 13, 2024

avikivity commented Jun 13, 2024

xemul commented Jun 13, 2024

avikivity commented Jun 16, 2024

xemul commented Jun 17, 2024

xemul commented Jun 17, 2024

xemul commented Jun 17, 2024

xemul commented Jun 17, 2024

xemul commented Jun 17, 2024

xemul commented Jun 18, 2024

xemul commented Jun 20, 2024

avikivity commented Jun 20, 2024

xemul commented Jun 21, 2024

xemul commented Jun 21, 2024

Implement cross-shard consumption fairness #2294

Are you sure you want to change the base?

Implement cross-shard consumption fairness #2294

Conversation

xemul commented Jun 13, 2024 • edited Loading

avikivity commented Jun 13, 2024

avikivity commented Jun 13, 2024

xemul commented Jun 13, 2024

avikivity commented Jun 16, 2024

xemul commented Jun 17, 2024

xemul commented Jun 17, 2024

xemul commented Jun 17, 2024

xemul commented Jun 17, 2024

xemul commented Jun 17, 2024

xemul commented Jun 18, 2024

xemul commented Jun 20, 2024

avikivity commented Jun 20, 2024

xemul commented Jun 21, 2024

xemul commented Jun 21, 2024

xemul commented Jun 13, 2024 •

edited

Loading