Skip to content

Commit

Permalink
io_queue: Oversubscribe to drain imbalance faster
Browse files Browse the repository at this point in the history
The natural lack of cross-shard fairness may lead to a nasty imbalance
problem. When a shard gets lots of requests queued (a spike) it will
try to drain its queue by dispatching requests on every tick. However,
if all other shards have something to do so that the disk capacity is
close to be exhausted, this overloaded shard will have little chance to
drain itself because every tick it will only get its "fair" amount of
capacity tokens, which is capacity/smp::count and that's it.

In order to drain the overloaded queue a shard should get more capacity
tokens than other shards. This will increase the pressure on other
shards, of course, "spreading" one shard queue among others thus
reducing the average latency of requests. When increasing the amount of
grabbed tokens there are two pitfals to avoid.

Both come from the fact that under described curcumstances shared
capacity is likely all exhausted and shards are "fighting" for tokens in
the "pending" state -- i.e. when they line up in the shared token bucket
for _future_ tokens, that will get there eventually as requests
complete. So...

1. If the capacity is all claimed by shards and shards continue to claim
   more, they will end-up in the "pending" state, which is -- they grab
   extra tokens from the shared capacity and "remember" their position
   in the shared queue when they are to get it. Thus, if an urgent
   request arrives at random shard in the worst case it will have to
   wait for this whole over-claimed line before it can get dispatched.
   Currently, the maximum length of the over-claimed queue is limited by
   one request per shard, which eventually equals to the
   io-latency-goal. If claiming _more_ than that, this would violate
   this time by the amount of over-claimed tokens, so it shouldn't be
   too large.

2. When increasing the pressure on the shared capacity, a shard has no
   idea if any other shard does the same. This means, that shard should
   try to avoid increasing the pressure "just because", there should be
   some yes-no reason for doing it, so that only "overloaded" shards try
   to grab more. If all shards suddenly get into this aggressive state,
   they will compensate each other, but according to p.1 the worst-case
   preemption latency would grow too high.

With the above two assumptions at hands, the proposed solution is to

a. Over-claim at most one (1) request from the local queue
b. Start over-claim once the local queue length goes above some
   threshold, and apply hysteresis on exisiting this state to avoid
   resonance.

The thresholds are pretty-much random in this patch -- 12 and 8 -- and
that's the biggest problem of it.

The issue can be reproduced with the help of recent io-tester over a
/dev/null storage :)

The io-properties.yaml:
```
disks:
  - mountpoint: /dev/null
    read_iops: 1200
    read_bandwidth: 1GB
    write_iops: 1200
    write_bandwidth: 1GB
```

The jobs conf.yaml:
```
- name: latency_reads_1
  shards: all
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 80
    rps: 1
    reqsize: 512
    shares: 1000

- name: latency_reads_1a
  shards: [0]
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 10
    limit: 100
    reqsize: 512
    class: latency_reads_1
```

Running it with 1 io group and 12 shards would result in shard 0
suffering from not-draining-ever queue and huge final latencies:

    shard p99 latency (usec)
       0: 1208561
       1: 14520
       2: 17456
       3: 15777
       4: 15488
       5: 14576
       6: 19251
       7: 20222
       8: 18338
       9: 21267
      10: 17083
      11: 16188

With this patch applied shard-0 would scatter its queue among other
shards within several ticks lowering its latency at the cost of other
shards's latencies:

    shard p99 latency (usec)
       0: 108345
       1: 102907
       2: 106900
       3: 105244
       4: 109214
       5: 107881
       6: 114278
       7: 114289
       8: 113560
       9: 105411
      10: 113898
      11: 112615

However, the larger the testing time, the smaller latencies become for
the 2nd test (and for the 1st too, but not for shard-0)

refs: #1083

Signed-off-by: Pavel Emelyanov <[email protected]>
  • Loading branch information
xemul committed Feb 28, 2023
1 parent 1347ca9 commit 866ee2a
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 0 deletions.
3 changes: 3 additions & 0 deletions include/seastar/core/fair_queue.hh
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,8 @@ private:
};

std::optional<pending> _pending;
bool _oversubscribing = false;
std::optional<pending> _oversubscribed;

void push_priority_class(priority_class_data& pc) noexcept;
void push_priority_class_from_idle(priority_class_data& pc) noexcept;
Expand All @@ -355,6 +357,7 @@ private:
enum class grab_result { grabbed, cant_preempt, pending };
grab_result grab_capacity(const fair_queue_entry& ent) noexcept;
grab_result grab_pending_capacity(const fair_queue_entry& ent) noexcept;
void oversubscribe_capacity(capacity_t cap) noexcept;
public:
/// Constructs a fair queue with configuration parameters \c cfg.
///
Expand Down
26 changes: 26 additions & 0 deletions src/core/fair_queue.cc
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,11 @@ auto fair_queue::grab_pending_capacity(const fair_queue_entry& ent) noexcept ->
}

_pending.reset();
if (_oversubscribed) {
_pending = *_oversubscribed;
_oversubscribed.reset();
}

return grab_result::grabbed;
}

Expand All @@ -284,6 +289,12 @@ auto fair_queue::grab_capacity(const fair_queue_entry& ent) noexcept -> grab_res
return grab_result::grabbed;
}

void fair_queue::oversubscribe_capacity(capacity_t cap) noexcept {
assert(_pending);
capacity_t want_head = _group.grab_capacity(cap);
_oversubscribed.emplace(want_head, cap);
}

void fair_queue::register_priority_class(class_id id, uint32_t shares) {
if (id >= _priority_classes.size()) {
_priority_classes.resize(id + 1);
Expand Down Expand Up @@ -326,6 +337,9 @@ fair_queue_ticket fair_queue::resources_currently_executing() const {
return _resources_executing;
}

static constexpr unsigned oversubscribe_start = 12;
static constexpr unsigned oversubscribe_stop = 8;

void fair_queue::queue(class_id id, fair_queue_entry& ent) noexcept {
priority_class_data& pc = *_priority_classes[id];
// We need to return a future in this function on which the caller can wait.
Expand All @@ -337,6 +351,9 @@ void fair_queue::queue(class_id id, fair_queue_entry& ent) noexcept {
pc._queue.push_back(ent);
_resources_queued += ent._ticket;
_requests_queued++;
if (_requests_queued >= oversubscribe_start) {
_oversubscribing = true;
}
}

void fair_queue::notify_request_finished(fair_queue_ticket desc) noexcept {
Expand Down Expand Up @@ -384,6 +401,12 @@ void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {
auto& req = h._queue.front();
auto gr = grab_capacity(req);
if (gr == grab_result::pending) {
if (_oversubscribing && !_oversubscribed) {
auto cap = _group.ticket_capacity(req._ticket);
if (cap > 0) {
oversubscribe_capacity(cap);
}
}
break;
}

Expand All @@ -401,6 +424,9 @@ void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {
_resources_queued -= req._ticket;
_requests_executing++;
_requests_queued--;
if (_requests_queued < oversubscribe_stop) {
_oversubscribing = false;
}

// Usually the cost of request is tens to hundreeds of thousands. However, for
// unrestricted queue it can be as low as 2k. With large enough shares this
Expand Down

0 comments on commit 866ee2a

Please sign in to comment.