Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io_queue: Oversubscribe to drain imbalance faster #1522

Closed

Conversation

xemul
Copy link
Contributor

@xemul xemul commented Feb 28, 2023

The natural lack of cross-shard fairness may lead to a nasty imbalance problem. When a shard gets lots of requests queued (a spike) it will try to drain its queue by dispatching requests on every tick. However, if all other shards have something to do so that the disk capacity is close to be exhausted, this overloaded shard will have little chance to drain itself because every tick it will only get its "fair" amount of capacity tokens, which is capacity/smp::count and that's it.

In order to drain the overloaded queue a shard should get more capacity tokens than other shards. This will increase the pressure on other shards, of course, "spreading" one shard queue among others thus reducing the average latency of requests. When increasing the amount of grabbed tokens there are two pitfals to avoid.

Both come from the fact that under described curcumstances shared capacity is likely all exhausted and shards are "fighting" for tokens in the "pending" state -- i.e. when they line up in the shared token bucket for future tokens, that will get there eventually as requests complete. So...

  1. If the capacity is all claimed by shards and shards continue to claim more, they will end-up in the "pending" state, which is -- they grab extra tokens from the shared capacity and "remember" their position in the shared queue when they are to get it. Thus, if an urgent request arrives at random shard in the worst case it will have to wait for this whole over-claimed line before it can get dispatched. Currently, the maximum length of the over-claimed queue is limited by one request per shard, which eventually equals to the io-latency-goal. If claiming more than that, this would violate this time by the amount of over-claimed tokens, so it shouldn't be too large.

  2. When increasing the pressure on the shared capacity, a shard has no idea if any other shard does the same. This means, that shard should try to avoid increasing the pressure "just because", there should be some yes-no reason for doing it, so that only "overloaded" shards try to grab more. If all shards suddenly get into this aggressive state, they will compensate each other, but according to p.1 the worst-case preemption latency would grow too high.

With the above two assumptions at hands, the proposed solution is to introduce per-class capacity-claim measure which grows monotonically with the class queue length and is proportional to class shares.

a. Over-claim at most one (1) request from the local queue

b. Start over-claim once the capacity claim goes above some threshold, and apply hysteresis on exisiting this state to avoid resonance

The capacity claim is deliberately selected to grow faster for high-prio queues with short requests (scylla query class) and grow much slower for low-prio queues with fat requests (scylla compaction/flush classes). So it doesn't care about requests lengths, but depends on shares value.

Also, since several classes may fluctuate around claim thresholds, the oversubscribing happens when there's at least one of that kind.

The thresholds are pretty-much random in this patch -- 12000 and 8000 -- and that's the biggest problem of it.

The issue can be reproduced with the help of recent io-tester over a /dev/null storage :)

The io-properties.yaml:

disks:
  - mountpoint: /dev/null
    read_iops: 1200
    read_bandwidth: 1GB
    write_iops: 1200
    write_bandwidth: 1GB

The jobs conf.yaml:

- name: latency_reads_1
  shards: all
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 80
    rps: 1
    reqsize: 512
    shares: 1000

- name: latency_reads_1a
  shards: [0]
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 10
    limit: 100
    reqsize: 512
    class: latency_reads_1

Running it with 1 io group and 12 shards would result in shard 0 suffering from not-draining-ever queue and huge final latencies:

shard p99 latency (usec)
   0: 1208561
   1: 14520
   2: 17456
   3: 15777
   4: 15488
   5: 14576
   6: 19251
   7: 20222
   8: 18338
   9: 21267
  10: 17083
  11: 16188

With this patch applied shard-0 would scatter its queue among other shards within several ticks lowering its latency at the cost of other shards's latencies:

shard p99 latency (usec)
   0: 108345
   1: 102907
   2: 106900
   3: 105244
   4: 109214
   5: 107881
   6: 114278
   7: 114289
   8: 113560
   9: 105411
  10: 113898
  11: 112615

However, the larger the testing time, the smaller latencies become for the 2nd test (and for the 1st too, but not for shard-0)

refs: #1083

@xemul xemul requested a review from avikivity February 28, 2023 16:23
@avikivity
Copy link
Member

There's a problem here. Shard 0 could have low-share requests queued. But shards don't communicate the share-mix of the requests, and therefore shard 0 could be preempting high-share work on shards [1..11].

This isn't theoretical, we expect lots of low-priority work to compete with smaller amount of latency sensitive work. This allows the low-priority work to dominate over the latency sensitive work.

The natural lack of cross-shard fairness may lead to a nasty imbalance
problem. When a shard gets lots of requests queued (a spike) it will
try to drain its queue by dispatching requests on every tick. However,
if all other shards have something to do so that the disk capacity is
close to be exhausted, this overloaded shard will have little chance to
drain itself because every tick it will only get its "fair" amount of
capacity tokens, which is capacity/smp::count and that's it.

In order to drain the overloaded queue a shard should get more capacity
tokens than other shards. This will increase the pressure on other
shards, of course, "spreading" one shard queue among others thus
reducing the average latency of requests. When increasing the amount of
grabbed tokens there are two pitfals to avoid.

Both come from the fact that under described curcumstances shared
capacity is likely all exhausted and shards are "fighting" for tokens in
the "pending" state -- i.e. when they line up in the shared token bucket
for _future_ tokens, that will get there eventually as requests
complete. So...

1. If the capacity is all claimed by shards and shards continue to claim
   more, they will end-up in the "pending" state, which is -- they grab
   extra tokens from the shared capacity and "remember" their position
   in the shared queue when they are to get it. Thus, if an urgent
   request arrives at random shard in the worst case it will have to
   wait for this whole over-claimed line before it can get dispatched.
   Currently, the maximum length of the over-claimed queue is limited by
   one request per shard, which eventually equals to the
   io-latency-goal. If claiming _more_ than that, this would violate
   this time by the amount of over-claimed tokens, so it shouldn't be
   too large.

2. When increasing the pressure on the shared capacity, a shard has no
   idea if any other shard does the same. This means, that shard should
   try to avoid increasing the pressure "just because", there should be
   some yes-no reason for doing it, so that only "overloaded" shards try
   to grab more. If all shards suddenly get into this aggressive state,
   they will compensate each other, but according to p.1 the worst-case
   preemption latency would grow too high.

With the above two assumptions at hands, the proposed solution is to
introduce per-class capacity-claim measure which grows monotonically
with the class queue length and is proportional to class shares.

a. Over-claim at most one (1) request from the local queue

b. Start over-claim once the capacity claim goes above some threshold,
   and apply hysteresis on exisiting this state to avoid resonance

The capacity claim is deliberately selected to grow faster for high-prio
queues with short requests (scylla query class) and grow much slower for
low-prio queues with fat requests (scylla compaction/flush classes). So
it doesn't care about requests lengths, but depends on shares value.

Also, since several classes may fluctuate around claim thresholds, the
oversubscribing happens when there's at least one of that kind.

The thresholds are pretty-much random in this patch -- 12000 and 8000 --
and that's the biggest problem of it.

The issue can be reproduced with the help of recent io-tester over a
/dev/null storage :)

The io-properties.yaml:
```
disks:
  - mountpoint: /dev/null
    read_iops: 1200
    read_bandwidth: 1GB
    write_iops: 1200
    write_bandwidth: 1GB
```

The jobs conf.yaml:
```
- name: latency_reads_1
  shards: all
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 80
    rps: 1
    reqsize: 512
    shares: 1000

- name: latency_reads_1a
  shards: [0]
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 10
    limit: 100
    reqsize: 512
    class: latency_reads_1
```

Running it with 1 io group and 12 shards would result in shard 0
suffering from not-draining-ever queue and huge final latencies:

    shard p99 latency (usec)
       0: 1208561
       1: 14520
       2: 17456
       3: 15777
       4: 15488
       5: 14576
       6: 19251
       7: 20222
       8: 18338
       9: 21267
      10: 17083
      11: 16188

With this patch applied shard-0 would scatter its queue among other
shards within several ticks lowering its latency at the cost of other
shards's latencies:

    shard p99 latency (usec)
       0: 108345
       1: 102907
       2: 106900
       3: 105244
       4: 109214
       5: 107881
       6: 114278
       7: 114289
       8: 113560
       9: 105411
      10: 113898
      11: 112615

However, the larger the testing time, the smaller latencies become for
the 2nd test (and for the 1st too, but not for shard-0)

refs: scylladb#1083

Signed-off-by: Pavel Emelyanov <[email protected]>
@xemul xemul force-pushed the br-io-queue-oversubscribe-on-overload branch from 866ee2a to 59b0dc9 Compare March 1, 2023 07:31
@xemul
Copy link
Contributor Author

xemul commented Mar 1, 2023

There's a problem here. Shard 0 could have low-share requests queued. But shards don't communicate the share-mix of the requests, and therefore shard 0 could be preempting high-share work on shards [1..11].

Fair point. But it can be addressed by checking more abstract "capacity claim". E.g. here's the update where the capacity claim is per-class queue_length * shares and it doesn't depend in requests lengths not to give compaction/flush bonus. With that compaction should (typically) have 10 times more requests than query class to beat one.

However, it's not great either, there silently appears 2nd configuration parameter -- ratio of high-prio classes vs low-prio classes. In this patch it's 1 * shares_a / shares_b, but it may work better if being e.g. 42 * shares_a / shares_b. Dunno.

Another option might be to explicitly mark IO classes with "latency sensitive" so that fair-queue only drives into oversubscribing more if these classes grow large.

@avikivity
Copy link
Member

I don't see why multiplying by shares help. The low-share classes can have any number of pending requests, in fact they usually will have a longer queue due to being given less bandwidth.

@avikivity
Copy link
Member

We can auto-detect latency sensitive classes. If the queue is empty (or there are no running tasks) then that queue is latency sensitive.

Once a scheduling group becomes overloaded, it doesn't help if we say it's latency sensitive, it will be throughput bound and the latency will be determined by concurrency, not the disk/cpu/scheduling latency.

@xemul
Copy link
Contributor Author

xemul commented Mar 1, 2023

I don't see why multiplying by shares help. The low-share classes can have any number of pending requests, in fact they usually will have a longer queue due to being given less bandwidth.

That's not what Vlad had been showing to me so far. In normal case query class queue length is indeed "few" requests and compaction queue length is up to tens of them. In troublesome cases query class queue length goes up to 100 (then someone immediately starts actively complaining), so does compaction, but it doesn't go to 1000s. Thus we have

Normal conditions:
                   queue length       shares     claim (in this patch)
query                        1          1000                    1000
compaction                  20           100                    2000

Emergency:
                   queue length       shares     claim (in this patch)
query                       50          1000                   50000
compaction                  60           100                    6000

So the threshold of, say, 12000 tells query class in emergency from the rest. But I agree, that 12k is randomly chosen number, in real life it can be different. But the general direction still can be applied here -- comparing queue_length * shares vs threshold can separate query class in emergency with the rest up to some extent.

We can auto-detect latency sensitive classes. If the queue is empty (or there are no running tasks) then that queue is latency sensitive.

Any queue can be empty. In fact, all the troubles seen so far started happening when compaction queue turned from being latency-sensitive in the above sense into ... something else.

Once a scheduling group becomes overloaded, it doesn't help if we say it's latency sensitive, ...

The goal here is not to push single latency-sensitive request into a contented shared queue as soon as possible, but to give a class with "accumulated long" queue more room in the shared queue so that it could drain itself faster.

@bhalevy
Copy link
Member

bhalevy commented Jul 31, 2023

@xemul, @avikivity, where are we standing with this PR?
#1083 is still opened and we need to decide on the direction of fixing it (or close it).

@xemul
Copy link
Contributor Author

xemul commented Aug 20, 2024

Obsoleted with #2294

@xemul xemul closed this Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants