Implement cross-shard consumption fairness #2294

The vaulue is used to convert request tokens from float-point number to some "sane" integer. Currently on ~200k IOPS ~2GBs disk it produces the following token values for requests of different sizes 512: 150k 1024: 160k 2048: 170k ... 131072: 1.4M These values are pretty huge an when accumulated even in 64-bit counter can overflow it in months time scale. Current code sort of accounts for that by checking the overflow and resetting the counters, but in the future there will be the need to reset counters on different shards, and that's going to be problematic. This patch reduces the factor 8 times, so that the costs are now 512: 19k 1024: 20k 2048: 21k ... 131072: 170k That's much more friendly to accumulating counters (the overflow is now at the year's scale which is pretty comfortable). Reducing it even further is problematic, here's why. In order to provide cross-class fairness the costs are divided by class shares for accumulation. Given a class of 1000 shares, the 512-bytes request becomes indistinguishable from 1k one with smaller factor. Said that, even with the new factor it's worth taking more care when dividing the cost at shares use div-roundup math. Signed-off-by: Pavel Emelyanov <[email protected]>

Current tests on fair queue try to make the queue submit requests in extremely controllable way -- one-by-one. However, the fair queue nowadays is driven by rated token bucket and is very sensitive to time and durations. It's better to teach the test accept the fact that it cannot control fair-queue requests submissions on per-request granularity and tunes its accounting instead. The change affects two places. Main loop. Before the change it called fair_queue::dispatch_requests() as many times are the number of requests test case wants to pass, then performed the necessary checks. Now, the method is called infinitely, and the handling only processes the requested amount of requests. The rest is ignored. Drain. Before the change it called dispatch_requests() in a loop until it returned anything. Now it's called in a loop until fair queue explicitly reports that it's empty. Signed-off-by: Pavel Emelyanov <[email protected]>

For convenience Signed-off-by: Pavel Emelyanov <[email protected]>

On each shard classes compete with each other by accumulating the sum of request costs that had been dispatched from them so far. Cost is the request capacity divided by the class shares. Dispatch loop then selects the class with the smallest accumulated value, thus providing shares-aware fairless -- the larger the shares value is, the slower the accumulator gorws, the more requests are picked from the class for dispatch. This patch implements similar approach across shards. For that, each shard accumnulates the dispatched cost from all classes. IO group keeps track of a vector of accumulated costs for each shard. When a shard wants to dispatch it first checks if it has run too far ahead of all other shards, and if it does, it skips the dispatch loop. Corner case -- when a queue gets drained, it "withdraws" itself from other shards' decisions by advancing its group counter to infinity. Respectively, when a group comes back it may forward its accumulator not to get too large advantage over other shards. When scheduling classes, shard has exclusive access to them and uses log-complex heap to pick the one with smallest consumption counter. Cross-shard balancing cannot afford it. Instead, each shard manipulates its own counter only, and to compare it with other shards' it scans the whole vector, which is not very cache-friendly and race-prone. Signed-off-by: Pavel Emelyanov <[email protected]>

The value is used to limit one shard in the amount of requests it's allowed to dispatch in one poll. This is to prevent it from consuming the whole capacity in one go and let other shards get their portion. Group-wide balancing (previous patch) made this fuse obsotele. Signed-off-by: Pavel Emelyanov <[email protected]>

Looking at group balance counters is not very lightweight and is better be avoided when possible. For that -- when balance is achieved, arm a timer for quiscent period, and check only after it expires. When the group is not balanced, check balance more frequently. Signed-off-by: Pavel Emelyanov <[email protected]>

It's pretty long, so not for automatic execition 2-shards tests: {'shard_0': {'iops': 88204.3828, 'shares': 100}, 'shard_1': {'iops': 89686.5156, 'shares': 100}} IOPS ratio 1.02, expected 1.0, deviation 1% {'shard_0': {'iops': 60321.3125, 'shares': 100}, 'shard_1': {'iops': 117566.406, 'shares': 200}} IOPS ratio 1.95, expected 2.0, deviation 2% {'shard_0': {'iops': 37326.2422, 'shares': 100}, 'shard_1': {'iops': 140555.062, 'shares': 400}} IOPS ratio 3.77, expected 4.0, deviation 5% {'shard_0': {'iops': 21547.6152, 'shares': 100}, 'shard_1': {'iops': 156309.891, 'shares': 800}} IOPS ratio 7.25, expected 8.0, deviation 9% 3-shards tests: {'shard_0': {'iops': 45211.9336, 'shares': 100}, 'shard_1': {'iops': 45211.9766, 'shares': 100}, 'shard_2': {'iops': 87412.9453, 'shares': 200}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 1.93, expected 2.0, deviation 3% {'shard_0': {'iops': 30992.2188, 'shares': 100}, 'shard_1': {'iops': 30992.2812, 'shares': 100}, 'shard_2': {'iops': 115887.609, 'shares': 400}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 3.74, expected 4.0, deviation 6% {'shard_0': {'iops': 19279.6348, 'shares': 100}, 'shard_1': {'iops': 19279.6934, 'shares': 100}, 'shard_2': {'iops': 139316.828, 'shares': 800}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 7.23, expected 8.0, deviation 9% {'shard_0': {'iops': 26505.9082, 'shares': 100}, 'shard_1': {'iops': 53011.9922, 'shares': 200}, 'shard_2': {'iops': 98369.4453, 'shares': 400}} shard-1 IOPS ratio 2.0, expected 2.0, deviation 0% shard-2 IOPS ratio 3.71, expected 4.0, deviation 7% {'shard_0': {'iops': 17461.8145, 'shares': 100}, 'shard_1': {'iops': 34923.8438, 'shares': 200}, 'shard_2': {'iops': 125470.43, 'shares': 800}} shard-1 IOPS ratio 2.0, expected 2.0, deviation 0% shard-2 IOPS ratio 7.19, expected 8.0, deviation 10% {'shard_0': {'iops': 14812.3037, 'shares': 100}, 'shard_1': {'iops': 58262, 'shares': 400}, 'shard_2': {'iops': 104794.633, 'shares': 800}} shard-1 IOPS ratio 3.93, expected 4.0, deviation 1% shard-2 IOPS ratio 7.07, expected 8.0, deviation 11% Signed-off-by: Pavel Emelyanov <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cross-shard consumption fairness #2294

Implement cross-shard consumption fairness #2294

Commits on Jun 20, 2024

Implement cross-shard consumption fairness #2294

Are you sure you want to change the base?

Implement cross-shard consumption fairness #2294

Commits on Jun 20, 2024