-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement cross-shard consumption fairness #2294
base: master
Are you sure you want to change the base?
Commits on Jun 20, 2024
-
fair_queue: Scale fixed-point factor
The vaulue is used to convert request tokens from float-point number to some "sane" integer. Currently on ~200k IOPS ~2GBs disk it produces the following token values for requests of different sizes 512: 150k 1024: 160k 2048: 170k ... 131072: 1.4M These values are pretty huge an when accumulated even in 64-bit counter can overflow it in months time scale. Current code sort of accounts for that by checking the overflow and resetting the counters, but in the future there will be the need to reset counters on different shards, and that's going to be problematic. This patch reduces the factor 8 times, so that the costs are now 512: 19k 1024: 20k 2048: 21k ... 131072: 170k That's much more friendly to accumulating counters (the overflow is now at the year's scale which is pretty comfortable). Reducing it even further is problematic, here's why. In order to provide cross-class fairness the costs are divided by class shares for accumulation. Given a class of 1000 shares, the 512-bytes request becomes indistinguishable from 1k one with smaller factor. Said that, even with the new factor it's worth taking more care when dividing the cost at shares use div-roundup math. Signed-off-by: Pavel Emelyanov <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5991f91 - Browse repository at this point
Copy the full SHA 5991f91View commit details -
fair_queue tests: Remember it is time-based
Current tests on fair queue try to make the queue submit requests in extremely controllable way -- one-by-one. However, the fair queue nowadays is driven by rated token bucket and is very sensitive to time and durations. It's better to teach the test accept the fact that it cannot control fair-queue requests submissions on per-request granularity and tunes its accounting instead. The change affects two places. Main loop. Before the change it called fair_queue::dispatch_requests() as many times are the number of requests test case wants to pass, then performed the necessary checks. Now, the method is called infinitely, and the handling only processes the requested amount of requests. The rest is ignored. Drain. Before the change it called dispatch_requests() in a loop until it returned anything. Now it's called in a loop until fair queue explicitly reports that it's empty. Signed-off-by: Pavel Emelyanov <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 67c0eb1 - Browse repository at this point
Copy the full SHA 67c0eb1View commit details -
fair_queue: Define signed_capacity_t type in fair_group
For convenience Signed-off-by: Pavel Emelyanov <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ecedc02 - Browse repository at this point
Copy the full SHA ecedc02View commit details -
fair_queue: Introduce group-wide capacity balancing
On each shard classes compete with each other by accumulating the sum of request costs that had been dispatched from them so far. Cost is the request capacity divided by the class shares. Dispatch loop then selects the class with the smallest accumulated value, thus providing shares-aware fairless -- the larger the shares value is, the slower the accumulator gorws, the more requests are picked from the class for dispatch. This patch implements similar approach across shards. For that, each shard accumnulates the dispatched cost from all classes. IO group keeps track of a vector of accumulated costs for each shard. When a shard wants to dispatch it first checks if it has run too far ahead of all other shards, and if it does, it skips the dispatch loop. Corner case -- when a queue gets drained, it "withdraws" itself from other shards' decisions by advancing its group counter to infinity. Respectively, when a group comes back it may forward its accumulator not to get too large advantage over other shards. When scheduling classes, shard has exclusive access to them and uses log-complex heap to pick the one with smallest consumption counter. Cross-shard balancing cannot afford it. Instead, each shard manipulates its own counter only, and to compare it with other shards' it scans the whole vector, which is not very cache-friendly and race-prone. Signed-off-by: Pavel Emelyanov <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bd90f9f - Browse repository at this point
Copy the full SHA bd90f9fView commit details -
fair_queue: Drop per-dispatch-loop threshold
The value is used to limit one shard in the amount of requests it's allowed to dispatch in one poll. This is to prevent it from consuming the whole capacity in one go and let other shards get their portion. Group-wide balancing (previous patch) made this fuse obsotele. Signed-off-by: Pavel Emelyanov <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 17309f3 - Browse repository at this point
Copy the full SHA 17309f3View commit details -
fair_queue: Amortize cross-shard balance checking
Looking at group balance counters is not very lightweight and is better be avoided when possible. For that -- when balance is achieved, arm a timer for quiscent period, and check only after it expires. When the group is not balanced, check balance more frequently. Signed-off-by: Pavel Emelyanov <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2d444d0 - Browse repository at this point
Copy the full SHA 2d444d0View commit details -
test: Add manual test for cross-shard balancing
It's pretty long, so not for automatic execition 2-shards tests: {'shard_0': {'iops': 88204.3828, 'shares': 100}, 'shard_1': {'iops': 89686.5156, 'shares': 100}} IOPS ratio 1.02, expected 1.0, deviation 1% {'shard_0': {'iops': 60321.3125, 'shares': 100}, 'shard_1': {'iops': 117566.406, 'shares': 200}} IOPS ratio 1.95, expected 2.0, deviation 2% {'shard_0': {'iops': 37326.2422, 'shares': 100}, 'shard_1': {'iops': 140555.062, 'shares': 400}} IOPS ratio 3.77, expected 4.0, deviation 5% {'shard_0': {'iops': 21547.6152, 'shares': 100}, 'shard_1': {'iops': 156309.891, 'shares': 800}} IOPS ratio 7.25, expected 8.0, deviation 9% 3-shards tests: {'shard_0': {'iops': 45211.9336, 'shares': 100}, 'shard_1': {'iops': 45211.9766, 'shares': 100}, 'shard_2': {'iops': 87412.9453, 'shares': 200}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 1.93, expected 2.0, deviation 3% {'shard_0': {'iops': 30992.2188, 'shares': 100}, 'shard_1': {'iops': 30992.2812, 'shares': 100}, 'shard_2': {'iops': 115887.609, 'shares': 400}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 3.74, expected 4.0, deviation 6% {'shard_0': {'iops': 19279.6348, 'shares': 100}, 'shard_1': {'iops': 19279.6934, 'shares': 100}, 'shard_2': {'iops': 139316.828, 'shares': 800}} shard-1 IOPS ratio 1.0, expected 1.0, deviation 0% shard-2 IOPS ratio 7.23, expected 8.0, deviation 9% {'shard_0': {'iops': 26505.9082, 'shares': 100}, 'shard_1': {'iops': 53011.9922, 'shares': 200}, 'shard_2': {'iops': 98369.4453, 'shares': 400}} shard-1 IOPS ratio 2.0, expected 2.0, deviation 0% shard-2 IOPS ratio 3.71, expected 4.0, deviation 7% {'shard_0': {'iops': 17461.8145, 'shares': 100}, 'shard_1': {'iops': 34923.8438, 'shares': 200}, 'shard_2': {'iops': 125470.43, 'shares': 800}} shard-1 IOPS ratio 2.0, expected 2.0, deviation 0% shard-2 IOPS ratio 7.19, expected 8.0, deviation 10% {'shard_0': {'iops': 14812.3037, 'shares': 100}, 'shard_1': {'iops': 58262, 'shares': 400}, 'shard_2': {'iops': 104794.633, 'shares': 800}} shard-1 IOPS ratio 3.93, expected 4.0, deviation 1% shard-2 IOPS ratio 7.07, expected 8.0, deviation 11% Signed-off-by: Pavel Emelyanov <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3ef9817 - Browse repository at this point
Copy the full SHA 3ef9817View commit details