Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dynamic tag info functionality for accurate tag calculation with multiple queues configured per server #92

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

sseshasa
Copy link
Contributor

@sseshasa sseshasa commented Nov 19, 2024

To better understand the fix, the problem with multiple queues configured per server
needs to be understood first. This problem is currently faced by Ceph (or Ceph OSDs)
which are clients of dmClock server. Ceph creates multiple op queues per OSD (a.k.a
OSD op queue shards) and each op queue runs independently with the same server
id (OSD ID). With this configuration, tags are calculated on each queue independently.
Therefore, a client could distribute requests across the multiple mClock queues but
the server cannot meet the clients' QoS settings due to the following problem:

Problem Description

Consider two queues configured on the same server with
items added in the following sequence:

   Enqueue--->|T5|T2|T0|--->Dequeue
                   Queue0

   Enqueue--->|T4|T3|T1|--->Dequeue
                   Queue1

Consider the request arrival times (arr_time) in each queue according
to the number associated with each tag, for example:

 - Req0 arrives on queue0 at time t0
 - Req1 arrives on queue1 at time t1
 - Req2 arrives on queue0 at time t2

and so on with the final Req5 arriving on queue0 at time t5

Consider x to be any of the client info parameters i.e. reservation, weight or limit

Tag calcs on Queue0 Tag calcs on Queue1
T0 = max((NO_TAG + (1/x)), arr_time) T1 = max((NO_TAG + (1/x)), arr_time)
T2 = max((T0 + (1/x)), arr_time) T3 = max((T1 + (1/x)), arr_time)
T5 = max((T2 + (1/x)), arr_time) T4 = max((T3 + (1/x)), arr_time)

For simplicity, assume reservation and limit is set to 3 IOPS, AtLimit is
set to 'Wait' and all requests arrived within a few milliseconds on
both queues.

It's pretty clear that all the 3 requests from both the queues will be
scheduled resulting in 6 IOPS which is not correct.

Proposed Solution

The fix is better understood by applying the solution to the above problem.
The fix is applicable to both DelayedTagCalc and ImmediateTagCalc.

Pre-conditions:

  • Dynamic tag info functionality (bool U2 in the pull constructor) is enabled (is_dynamic_tag_info_f).
  • The dmClock client maintains a mapping of clientId to ReqTagInfo for each client.
  • The dmClock client registers custom implementations for reqtag_updt_f and reqtag_info_f.

Solution:

The fix involves enabling the dynamic tag info for two operations by the following entities:

  1. the server - to update (using reqtag_updt_f) the dmClock client with latest tag for a clientId on a given queue once it's calculated. This is called after tags are calculated in initial_tag() and update_next_tag().

  2. the dmClock client - to calculate and update the tick interval before adding a request to the mClock queue. For a given queue, the tick interval is the number of request(s) added to other mClock queues before the current one arriving on itself. The latest tick interval is read by the server using reqtag_info_f before calculating the initial_tag(). See calc_interval_tag() on how the new tag is calculated and see the unit tests on how the client calculates the tick interval before adding the request to the queue. It's important to note that for DelayedTagCalc, the tag will be calculated as part of initial_tag() ONLY for non-zero tick intervals.

The tag calculation for each tag is shown below with the fix in place:
(Apologies for the table formatting. Please consider the tag calculation row as the start of a row)

Tag calcs on Queue0 Tag calcs on Queue1
T0 = max((NO_TAG + (1/x)), arr_time) T1 = max((T0 + (i/x)), arr_time)
tick interval i = 0; tick interval i = 1 because req1
arrived after req0 on Queue0.
[Note: If tick_inteval > 0, the tag
is calculated as part of
initial_tag() for DelayedTagCalc]
T2 = max((T1 + (i/x)), arr_time) T3 = max((T2 + (i/x)), arr_time)
tick interval i = 1 tick interval i = 1
T5 = max((T3 + (i/x)), arr_time) T4 = max((T3 + (1/x)), arr_time)
tick interval i = 2 because req5 tick interval i = 0 because req4
arrived two requests after T2. follows immediately after req3 on
the same queue.
[Note: If tick_interval == 0, the tag
is calculated only when it gets to the front
as part of update_next_tag() as usual
if DelayedTagCalc is enabled.]

For simplicity, assume reservation and limit is set to 3 IOPS, AtLimit is set to 'Wait' and all requests arrived within a second on both queues.

With the fix in place only T0, T1 and T2 will be scheduled as expected during the first phase since each request is spaced 1/res apart i.e., 1/3rd of a second. The same pattern will apply for the rest of the requests in the queues thus ensuring accurate scheduling of requests across all the queues with the same server.

Tests:
A lot of unit tests have been added to exercise the solution with both delayed and immediate tag calculation. Of particular importance are the tests that completely randomize queue additions and pulls from multiple(5 Nos.) queues. The tests also provide an insight into how the clients calculate the latest tick interval and how the latest tag is updated by the client via the dynamic functions. See:
- pull_reservation_randomize_delydtag
- pull_reservation_randomize_immtag
- pull_weight_randomize_delydtag
- pull_weight_randomize_immtag

Signed-off-by: Sridhar Seshasayee [email protected]

Include cstdint.h to resolve compilation failures for e.g.,
dmclock/src/dmclock_recs.h:26:21: error: ‘uint64_t’ does not name a type
   26 |     using Counter = uint64_t;
      |                     ^~~~~~~~

Signed-off-by: Sridhar Seshasayee <[email protected]>
Add a couple of tests using immediate and delayed tag calculation to
demonstrate inaccurate QoS provided to a client when multiple queues are
configured per server. The requests from the client are distributed across
multiple queues. Since the queues are independent, the tag calculations
are also independent and this results in inaccurate QoS provided to the
client as the tests show. The following scenario outlines problem with
multiple queues per server.

Tag Calculation with Single Queue:
----------------------------------

 Enqueue--->|T5|T4|T3|T2|T1|T0|--->Dequeue
         Tags for items in Single Queue

  Where, Tn = Tag value at time = n

For simplicity, consider x to be any of the
client info parameters i.e. reservation, weight or limit

1. Tag calculation for T1 in the queue would look like:

    T1 = max((T0 + (1/x)), arr_time)
    where, T0 - Tag at time 0
           x  - client info [res|wgt|lim], and,
           arr_time - Arrival time of the request in the tag

2. Similarly, Tag calculation for T5:

    T5 = max((T4 + (1/x)), arr_time)

As seen above, there is no issue when a single mClock
queue is in operation. For all the requests, the tags are
calculated based on their arrival times and in sequence.

Tag Calculation with Multiple Queues per Server:
-----------------------------------------------
Consider two queues configured on the same server with
items added in the following sequence:

   Enqueue--->|T5|T2|T0|--->Dequeue
                   Queue0

   Enqueue--->|T4|T3|T1|--->Dequeue
                   Queue1

Consider the request arrival times in each queue according
to the number associated with each tag, for example:

 - Req0 arrives on queue0 at time t0
 - Req1 arrives on queue1 at time t1
 - Req2 arrives on queue0 at time t2
and so on with the final Req5 arriving on queue0 at time t5

Tag calcs on Queue0                 | Tag calcs on Queue1
------------------------------------|---------------------------------
T0 = max((NO_TAG + (1/x)), arr_time)| T1 = max((NO_TAG + (1/x)), arr_time)
T2 = max((T0 + (1/x)), arr_time)    | T3 = max((T1 + (1/x)), arr_time)
T5 = max((T2 + (1/x)), arr_time)    | T4 = max((T3 + (1/x)), arr_time)

For simplicity, assume reservation and limit is set to 3 IOPS, AtLimit is
set to 'Wait' and all requests arrived well within a few milliseconds on
both queues.

It's pretty clear that all the 3 requests from both the queues will be
scheduled resulting in 6 IOPS which is not correct.

Signed-off-by: Sridhar Seshasayee <[email protected]>
A function to verify that a tag is valid would be particularly useful
when handling delayed tag calculation involving requests from clients
distributed across multiple queues.

Signed-off-by: Sridhar Seshasayee <[email protected]>
A new structure called ReqTagInfo is introduced to enable clients of
dmClock to share the latest RequestTag and the latest tick interval among
the set of queues spawned per server. The following are the parameters
and their purpose:

 1. last_tag: (read/write) This parameter holds the latest calculated tag
    across all the queues. This forms the base tag for a server to
    calculate the next tag when a new request is added to its queue. This
    parameter is both read/write from a server standpoint.

 2. last_tick_interval: The total number of requests handled by other
    queues on a server before the request on a given queue. This is
    updated by dmClock clients and read by the server to calculate the
    next accurate tag.

Signed-off-by: Sridhar Seshasayee <[email protected]>
The groundwork to set up dynamic tag info is similar to dynamic client
info with a key difference. The client info relies on only one function
to enable clients to update the client info parameters. But dynamic tag
info via is_dynamic_tag_info_f relies on two functions outlined below
to enable both the client and server to update and get the necessary
information to calculate the next accurate tag. The client of dmClock is
expected to maintain a map of clientId to ReqTagInfo similar to
ClientInfo.

The two dynamic tag info functions are:

 1) reqtag_updt_f: The server uses this function to update the latest
    calculated tag to the dmClock client. This is called whenever a new
    tag is calculated either as part of the initial_tag() or
    update_next_tag() depending on the type tag calculation employed.

    The client on the other hand updates the 'last_tick_interval' in the
    map just before adding a new request to the queue.

 2) reqtag_info_f: The server uses this function to get the latest tag
    for a clientId by reading the 'last_tag' and the latest tick interval
    (for DelayedTagCalc) via 'last_tick_interval'. This information is
    used to calculate the next accurate tag.

To enable the above functionality, additional PriorityQueueBase and
PullPriorityQueue constructors are defined for dmClock clients to
appropriately set up the dynamic functions and the associated data
structures. A bool U2 (false by default) is used to tell the server to
enable dynamic tag info functionality. This is propagated accordingly to
both the PullPriorityQueue and PushPriorityQueue constructors.

However, it is important to note that the dynamic tag info functionality
is currently only implemented for PullPriorityQueue.

Signed-off-by: Sridhar Seshasayee <[email protected]>
@sseshasa sseshasa force-pushed the wip-fix-multiple-queue-tag-fix branch from d11cbbe to 9095b48 Compare November 25, 2024 13:48
@sseshasa sseshasa changed the title Wip fix multiple queue tag fix Implement dynamic tag info functionality for accurate tag calculation with multiple queues configured per server Nov 25, 2024
@sseshasa sseshasa force-pushed the wip-fix-multiple-queue-tag-fix branch from 9095b48 to 4ef01f6 Compare November 25, 2024 15:36
@sseshasa sseshasa marked this pull request as ready for review November 25, 2024 15:37
@sseshasa
Copy link
Contributor Author

sseshasa commented Nov 25, 2024

@athanatos @ivancich @rzarzynski @neha-ojha Please take time to review this PR. I know most of you will be out for Cephalocon. In the meantime, I am hoping to get the changes on the Ceph side ready and get some real tests going. Thanks!

@sseshasa sseshasa force-pushed the wip-fix-multiple-queue-tag-fix branch from 4ef01f6 to 08bb47b Compare November 26, 2024 05:36
@sseshasa sseshasa force-pushed the wip-fix-multiple-queue-tag-fix branch from 08bb47b to a68aa26 Compare December 3, 2024 07:51
Similar to ClientInfo, a pointer(tag_info) to ReqTagInfo is maintained
within the ClientRec structure and read/written as appropriate via the
dynamic tag info functions.

The description of the fix is better understood with the following
example. The fix is applicable to both DelayedTagCalc and ImmediateTagCalc.
Note that for ImmediateTagCalc, the tick_interval is not used since the
tag is always calculated before it's added to the queue and therefore only
the last_tag from ReqTagInfo is necessary. Therefore in the case of
ImmediateTagCalc, tick_interval will always be 0.

Consider two queues configured on the same server with items added in the
following sequence. Consider that DelayedTagCalc is enabled:

   Enqueue--->|T5|T2|T0|--->Dequeue
                   Queue0

   Enqueue--->|T4|T3|T1|--->Dequeue
                   Queue1

Consider the request arrival times in each queue according
to the number associated with each tag, for example:

  - Req0 arrives on queue0 at time t0
  - Req1 arrives on queue1 at time t1
  - Req2 arrives on queue0 at time t2
and so on with the final Req5 arriving on queue0 at time t5

Pre-conditions:
---------------
- Dynamic tag info functionality (bool U2 in the pull constructor)
  is enabled (is_dynamic_tag_info_f).
- The dmClock client maintains a mapping of clientId to ReqTagInfo for
  each client.
- The dmClock client registers custom implementations for reqtag_updt_f
  and reqtag_info_f.

Solution:
---------
The fix involves enabling the dynamic tag info for two operations by the
following entities:
 - the server -  to update (using reqtag_updt_f) the dmClock client with
   latest tag for a clientId on a given queue once it's calculated. This
   is called after tags are calculated in initial_tag() and
   update_next_tag().

 - the dmClock client - to calculate and update the tick interval before
   adding a request to the mClock queue. For a given queue, the tick
   interval is the number of request(s) added to other mClock queues
   before the current one arrives on itself. The latest tick interval
   is read by the server using reqtag_info_f before calculating the
   initial_tag. See calc_interval_tag() on how the new tag is
   calculated and see the unit tests for an example on how the client
   calculates the tick interval before adding the request.

The calculation of each tag is shown below with the fix in place.
- Consider 'x' to be any of the  client info parameters i.e. reservation,
  weight or limit
- Consider arr_time to be the arrival time of a request

Tag calcs on Queue0                 | Tag calcs on Queue1
------------------------------------|---------------------------------
tick interval i = 0                 | tick interval i = 1 because req1
                                    | arrived after req0 on Queue0.
T0 = max((NO_TAG + (1/x)), arr_time)| T1 = max((T0 + (i/x)), arr_time)
                                    | [Note: If tick_inteval > 0, the tag
                                    |  is calculated as part of
                                    |  initial_tag() for DelayedTagCalc]
--------------------------------------------------------------------------
tick interval i = 1                 | tick interval i = 1
T2 = max((T1 + (i/x)), arr_time)    | T3 = max((T2 + (i/x)), arr_time)
--------------------------------------------------------------------------
tick interval i = 2 because req5    | tick interval i = 0 because req4
arrived two requests after T2.      | follows immediately after req3 on
                                    | the same queue.
T5 = max((T3 + (i/x)), arr_time)    | T4 = max((T3 + (1/x)), arr_time)
                                    | [Note: If tick_inteval == 0, the tag
                                    |  is calculated as part of
                                    |  update_next_tag() if DelayedTagCalc
                                    |  is enabled.]

For simplicity, assume reservation and limit is set to 3 IOPS, AtLimit is
set to 'Wait' and all requests arrived within a second on both queues.

With the fix in place only T0, T1 and T2 will be scheduled as expected
during the first phase since each request is spaced 1/res apart i.e.
1/3rd of a second. The same pattern will apply for the rest of the
requests in the queues thus ensuring accurate scheduling of requests
across all the queues with the same server.

Tests:
------
A lot of unit tests have been added to exercise the solution with both
delayed and immediate tag calculation. Of particular importance are the
tests that completely randomize queue additions and pulls from the
queues. The tests also provide an insight into how the clients calculate
the latest tick interval and how the latest tag is updated by the client
via the dynamic functions. See:
 - pull_reservation_randomize_delydtag
 - pull_reservation_randomize_immtag
 - pull_weight_randomize_delydtag
 - pull_weight_randomize_immtag

Signed-off-by: Sridhar Seshasayee <[email protected]>
@sseshasa sseshasa force-pushed the wip-fix-multiple-queue-tag-fix branch from a68aa26 to a619a46 Compare December 4, 2024 15:44
@athanatos
Copy link
Contributor

In your first example above, why can't we avoid the problem by simply setting per-queue reservation/limit to 1/3 (1/<num_queues>) of the value we want for the whole OSD?

@athanatos
Copy link
Contributor

reqtag_updt_f would be updating memory shared between queues, right?

@sseshasa
Copy link
Contributor Author

sseshasa commented Dec 9, 2024

In your first example above, why can't we avoid the problem by simply setting per-queue reservation/limit to 1/3 (1/<num_queues>) of the value we want for the whole OSD?

This may not work in all scenarios. For e.g., if a subset of queues are active for a workload, the realized IOPS may be lower than what is set by the client.

I recall having tried this on a cluster a while ago but don't remember what the result looked like. I will try it again and get back.

But I tried this with the randomized unit test without dynamic tag info and the number of requests dequeued from the client was lower than expected but not by much. I suspect on an actual cluster we may observe lower than expected IOPS.

Let me run a few tests on a cluster and get back with more details.

@sseshasa
Copy link
Contributor Author

sseshasa commented Dec 9, 2024

reqtag_updt_f would be updating memory shared between queues, right?

Yes, that's correct. The following block diagram should help clarify things with respect to which entity maintains and creates the associated data structures and dynamic functions. More details will be added to this diagram as things progress:

dmclock_multiq_solution

@sseshasa
Copy link
Contributor Author

sseshasa commented Dec 9, 2024

Let me run a few tests on a cluster and get back with more details.

With a single client, I ran a test using rados bench with your suggestion i.e., divide res and lim by number of shards. The results show that client is unable to achieve the set limit. Res was set to 125 IOPS and Limit to 625 IOPS and wgt set to 1. On each shard, res and lim was divided by the number of shards. For comparison, I ran the test with a single shard. Here's the outcome:

client QoS: [res:125 wgt:1 lim:625]

With 5 shards & res and limit reduced by num_shards

The average IOPS is off by around 70 IOPS as shown below:

Total time run:         120.086
Total writes made:      66737
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     2.17087
Stddev Bandwidth:       0.0778568
Max bandwidth (MB/sec): 2.32812
Min bandwidth (MB/sec): 1.92969
Average IOPS:           555
Stddev IOPS:            19.9313
Max IOPS:               596
Min IOPS:               494
Average Latency(s):     0.0287745
Stddev Latency(s):      0.0245614
Max latency(s):         0.137659
Min latency(s):         0.000275006

With 1 OSD shard

Total time run:         120.026
Total writes made:      74925
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     2.43845
Stddev Bandwidth:       0.0278175
Max bandwidth (MB/sec): 2.46875
Min bandwidth (MB/sec): 2.15625
Average IOPS:           624
Stddev IOPS:            7.12127
Max IOPS:               632
Min IOPS:               552
Average Latency(s):     0.0256237
Stddev Latency(s):      0.0040078
Max latency(s):         0.143611
Min latency(s):         0.000396659

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants