Implement dynamic tag info functionality for accurate tag calculation with multiple queues configured per server #92

sseshasa · 2024-11-19T15:26:03Z

To better understand the fix, the problem with multiple queues configured per server
needs to be understood first. This problem is currently faced by Ceph (or Ceph OSDs)
which are clients of dmClock server. Ceph creates multiple op queues per OSD (a.k.a
OSD op queue shards) and each op queue runs independently with the same server
id (OSD ID). With this configuration, tags are calculated on each queue independently.
Therefore, a client could distribute requests across the multiple mClock queues but
the server cannot meet the clients' QoS settings due to the following problem:

Problem Description

Consider two queues configured on the same server with
items added in the following sequence:

   Enqueue--->|T5|T2|T0|--->Dequeue
                   Queue0

   Enqueue--->|T4|T3|T1|--->Dequeue
                   Queue1

Consider the request arrival times (arr_time) in each queue according
to the number associated with each tag, for example:

 - Req0 arrives on queue0 at time t0
 - Req1 arrives on queue1 at time t1
 - Req2 arrives on queue0 at time t2

and so on with the final Req5 arriving on queue0 at time t5

Consider x to be any of the client info parameters i.e. reservation, weight or limit

Tag calcs on Queue0	Tag calcs on Queue1
T0 = max((NO_TAG + (1/x)), arr_time)	T1 = max((NO_TAG + (1/x)), arr_time)
T2 = max((T0 + (1/x)), arr_time)	T3 = max((T1 + (1/x)), arr_time)
T5 = max((T2 + (1/x)), arr_time)	T4 = max((T3 + (1/x)), arr_time)

For simplicity, assume reservation and limit is set to 3 IOPS, AtLimit is
set to 'Wait' and all requests arrived within a few milliseconds on
both queues.

It's pretty clear that all the 3 requests from both the queues will be
scheduled resulting in 6 IOPS which is not correct.

Proposed Solution

The fix is better understood by applying the solution to the above problem.
The fix is applicable to both DelayedTagCalc and ImmediateTagCalc.

Pre-conditions:

Dynamic tag info functionality (bool U2 in the pull constructor) is enabled (is_dynamic_tag_info_f).
The dmClock client maintains a mapping of clientId to ReqTagInfo for each client.
The dmClock client registers custom implementations for reqtag_updt_f and reqtag_info_f.

Solution:

The fix involves enabling the dynamic tag info for two operations by the following entities:

the server - to update (using reqtag_updt_f) the dmClock client with latest tag for a clientId on a given queue once it's calculated. This is called after tags are calculated in initial_tag() and update_next_tag().
the dmClock client - to calculate and update the tick interval before adding a request to the mClock queue. For a given queue, the tick interval is the number of request(s) added to other mClock queues before the current one arriving on itself. The latest tick interval is read by the server using reqtag_info_f before calculating the initial_tag(). See calc_interval_tag() on how the new tag is calculated and see the unit tests on how the client calculates the tick interval before adding the request to the queue. It's important to note that for DelayedTagCalc, the tag will be calculated as part of initial_tag() ONLY for non-zero tick intervals.

The tag calculation for each tag is shown below with the fix in place:
(Apologies for the table formatting. Please consider the tag calculation row as the start of a row)

Tag calcs on Queue0	Tag calcs on Queue1
T0 = max((NO_TAG + (1/x)), arr_time)	T1 = max((T0 + (i/x)), arr_time)
tick interval i = 0;	tick interval i = 1 because req1
	arrived after req0 on Queue0.
	[Note: If tick_inteval > 0, the tag
	is calculated as part of
	initial_tag() for DelayedTagCalc]

T2 = max((T1 + (i/x)), arr_time)	T3 = max((T2 + (i/x)), arr_time)
tick interval i = 1	tick interval i = 1

T5 = max((T3 + (i/x)), arr_time)	T4 = max((T3 + (1/x)), arr_time)
tick interval i = 2 because req5	tick interval i = 0 because req4
arrived two requests after T2.	follows immediately after req3 on
	the same queue.
	[Note: If tick_interval == 0, the tag
	is calculated only when it gets to the front
	as part of `update_next_tag()` as usual
	if DelayedTagCalc is enabled.]

For simplicity, assume reservation and limit is set to 3 IOPS, AtLimit is set to 'Wait' and all requests arrived within a second on both queues.

With the fix in place only T0, T1 and T2 will be scheduled as expected during the first phase since each request is spaced 1/res apart i.e., 1/3rd of a second. The same pattern will apply for the rest of the requests in the queues thus ensuring accurate scheduling of requests across all the queues with the same server.

Tests:
A lot of unit tests have been added to exercise the solution with both delayed and immediate tag calculation. Of particular importance are the tests that completely randomize queue additions and pulls from multiple(5 Nos.) queues. The tests also provide an insight into how the clients calculate the latest tick interval and how the latest tag is updated by the client via the dynamic functions. See:
- pull_reservation_randomize_delydtag
- pull_reservation_randomize_immtag
- pull_weight_randomize_delydtag
- pull_weight_randomize_immtag

Signed-off-by: Sridhar Seshasayee [email protected]

Include cstdint.h to resolve compilation failures for e.g., dmclock/src/dmclock_recs.h:26:21: error: ‘uint64_t’ does not name a type 26 | using Counter = uint64_t; | ^~~~~~~~ Signed-off-by: Sridhar Seshasayee <[email protected]>

Add a couple of tests using immediate and delayed tag calculation to demonstrate inaccurate QoS provided to a client when multiple queues are configured per server. The requests from the client are distributed across multiple queues. Since the queues are independent, the tag calculations are also independent and this results in inaccurate QoS provided to the client as the tests show. The following scenario outlines problem with multiple queues per server. Tag Calculation with Single Queue: ---------------------------------- Enqueue--->|T5|T4|T3|T2|T1|T0|--->Dequeue Tags for items in Single Queue Where, Tn = Tag value at time = n For simplicity, consider x to be any of the client info parameters i.e. reservation, weight or limit 1. Tag calculation for T1 in the queue would look like: T1 = max((T0 + (1/x)), arr_time) where, T0 - Tag at time 0 x - client info [res|wgt|lim], and, arr_time - Arrival time of the request in the tag 2. Similarly, Tag calculation for T5: T5 = max((T4 + (1/x)), arr_time) As seen above, there is no issue when a single mClock queue is in operation. For all the requests, the tags are calculated based on their arrival times and in sequence. Tag Calculation with Multiple Queues per Server: ----------------------------------------------- Consider two queues configured on the same server with items added in the following sequence: Enqueue--->|T5|T2|T0|--->Dequeue Queue0 Enqueue--->|T4|T3|T1|--->Dequeue Queue1 Consider the request arrival times in each queue according to the number associated with each tag, for example: - Req0 arrives on queue0 at time t0 - Req1 arrives on queue1 at time t1 - Req2 arrives on queue0 at time t2 and so on with the final Req5 arriving on queue0 at time t5 Tag calcs on Queue0 | Tag calcs on Queue1 ------------------------------------|--------------------------------- T0 = max((NO_TAG + (1/x)), arr_time)| T1 = max((NO_TAG + (1/x)), arr_time) T2 = max((T0 + (1/x)), arr_time) | T3 = max((T1 + (1/x)), arr_time) T5 = max((T2 + (1/x)), arr_time) | T4 = max((T3 + (1/x)), arr_time) For simplicity, assume reservation and limit is set to 3 IOPS, AtLimit is set to 'Wait' and all requests arrived well within a few milliseconds on both queues. It's pretty clear that all the 3 requests from both the queues will be scheduled resulting in 6 IOPS which is not correct. Signed-off-by: Sridhar Seshasayee <[email protected]>

A function to verify that a tag is valid would be particularly useful when handling delayed tag calculation involving requests from clients distributed across multiple queues. Signed-off-by: Sridhar Seshasayee <[email protected]>

A new structure called ReqTagInfo is introduced to enable clients of dmClock to share the latest RequestTag and the latest tick interval among the set of queues spawned per server. The following are the parameters and their purpose: 1. last_tag: (read/write) This parameter holds the latest calculated tag across all the queues. This forms the base tag for a server to calculate the next tag when a new request is added to its queue. This parameter is both read/write from a server standpoint. 2. last_tick_interval: The total number of requests handled by other queues on a server before the request on a given queue. This is updated by dmClock clients and read by the server to calculate the next accurate tag. Signed-off-by: Sridhar Seshasayee <[email protected]>

The groundwork to set up dynamic tag info is similar to dynamic client info with a key difference. The client info relies on only one function to enable clients to update the client info parameters. But dynamic tag info via is_dynamic_tag_info_f relies on two functions outlined below to enable both the client and server to update and get the necessary information to calculate the next accurate tag. The client of dmClock is expected to maintain a map of clientId to ReqTagInfo similar to ClientInfo. The two dynamic tag info functions are: 1) reqtag_updt_f: The server uses this function to update the latest calculated tag to the dmClock client. This is called whenever a new tag is calculated either as part of the initial_tag() or update_next_tag() depending on the type tag calculation employed. The client on the other hand updates the 'last_tick_interval' in the map just before adding a new request to the queue. 2) reqtag_info_f: The server uses this function to get the latest tag for a clientId by reading the 'last_tag' and the latest tick interval (for DelayedTagCalc) via 'last_tick_interval'. This information is used to calculate the next accurate tag. To enable the above functionality, additional PriorityQueueBase and PullPriorityQueue constructors are defined for dmClock clients to appropriately set up the dynamic functions and the associated data structures. A bool U2 (false by default) is used to tell the server to enable dynamic tag info functionality. This is propagated accordingly to both the PullPriorityQueue and PushPriorityQueue constructors. However, it is important to note that the dynamic tag info functionality is currently only implemented for PullPriorityQueue. Signed-off-by: Sridhar Seshasayee <[email protected]>

sseshasa · 2024-11-25T16:00:57Z

@athanatos @ivancich @rzarzynski @neha-ojha Please take time to review this PR. I know most of you will be out for Cephalocon. In the meantime, I am hoping to get the changes on the Ceph side ready and get some real tests going. Thanks!

Similar to ClientInfo, a pointer(tag_info) to ReqTagInfo is maintained within the ClientRec structure and read/written as appropriate via the dynamic tag info functions. The description of the fix is better understood with the following example. The fix is applicable to both DelayedTagCalc and ImmediateTagCalc. Note that for ImmediateTagCalc, the tick_interval is not used since the tag is always calculated before it's added to the queue and therefore only the last_tag from ReqTagInfo is necessary. Therefore in the case of ImmediateTagCalc, tick_interval will always be 0. Consider two queues configured on the same server with items added in the following sequence. Consider that DelayedTagCalc is enabled: Enqueue--->|T5|T2|T0|--->Dequeue Queue0 Enqueue--->|T4|T3|T1|--->Dequeue Queue1 Consider the request arrival times in each queue according to the number associated with each tag, for example: - Req0 arrives on queue0 at time t0 - Req1 arrives on queue1 at time t1 - Req2 arrives on queue0 at time t2 and so on with the final Req5 arriving on queue0 at time t5 Pre-conditions: --------------- - Dynamic tag info functionality (bool U2 in the pull constructor) is enabled (is_dynamic_tag_info_f). - The dmClock client maintains a mapping of clientId to ReqTagInfo for each client. - The dmClock client registers custom implementations for reqtag_updt_f and reqtag_info_f. Solution: --------- The fix involves enabling the dynamic tag info for two operations by the following entities: - the server - to update (using reqtag_updt_f) the dmClock client with latest tag for a clientId on a given queue once it's calculated. This is called after tags are calculated in initial_tag() and update_next_tag(). - the dmClock client - to calculate and update the tick interval before adding a request to the mClock queue. For a given queue, the tick interval is the number of request(s) added to other mClock queues before the current one arrives on itself. The latest tick interval is read by the server using reqtag_info_f before calculating the initial_tag. See calc_interval_tag() on how the new tag is calculated and see the unit tests for an example on how the client calculates the tick interval before adding the request. The calculation of each tag is shown below with the fix in place. - Consider 'x' to be any of the client info parameters i.e. reservation, weight or limit - Consider arr_time to be the arrival time of a request Tag calcs on Queue0 | Tag calcs on Queue1 ------------------------------------|--------------------------------- tick interval i = 0 | tick interval i = 1 because req1 | arrived after req0 on Queue0. T0 = max((NO_TAG + (1/x)), arr_time)| T1 = max((T0 + (i/x)), arr_time) | [Note: If tick_inteval > 0, the tag | is calculated as part of | initial_tag() for DelayedTagCalc] -------------------------------------------------------------------------- tick interval i = 1 | tick interval i = 1 T2 = max((T1 + (i/x)), arr_time) | T3 = max((T2 + (i/x)), arr_time) -------------------------------------------------------------------------- tick interval i = 2 because req5 | tick interval i = 0 because req4 arrived two requests after T2. | follows immediately after req3 on | the same queue. T5 = max((T3 + (i/x)), arr_time) | T4 = max((T3 + (1/x)), arr_time) | [Note: If tick_inteval == 0, the tag | is calculated as part of | update_next_tag() if DelayedTagCalc | is enabled.] For simplicity, assume reservation and limit is set to 3 IOPS, AtLimit is set to 'Wait' and all requests arrived within a second on both queues. With the fix in place only T0, T1 and T2 will be scheduled as expected during the first phase since each request is spaced 1/res apart i.e. 1/3rd of a second. The same pattern will apply for the rest of the requests in the queues thus ensuring accurate scheduling of requests across all the queues with the same server. Tests: ------ A lot of unit tests have been added to exercise the solution with both delayed and immediate tag calculation. Of particular importance are the tests that completely randomize queue additions and pulls from the queues. The tests also provide an insight into how the clients calculate the latest tick interval and how the latest tag is updated by the client via the dynamic functions. See: - pull_reservation_randomize_delydtag - pull_reservation_randomize_immtag - pull_weight_randomize_delydtag - pull_weight_randomize_immtag Signed-off-by: Sridhar Seshasayee <[email protected]>

athanatos · 2024-12-06T18:12:12Z

In your first example above, why can't we avoid the problem by simply setting per-queue reservation/limit to 1/3 (1/<num_queues>) of the value we want for the whole OSD?

athanatos · 2024-12-06T18:13:35Z

reqtag_updt_f would be updating memory shared between queues, right?

sseshasa · 2024-12-09T11:28:45Z

In your first example above, why can't we avoid the problem by simply setting per-queue reservation/limit to 1/3 (1/<num_queues>) of the value we want for the whole OSD?

This may not work in all scenarios. For e.g., if a subset of queues are active for a workload, the realized IOPS may be lower than what is set by the client.

I recall having tried this on a cluster a while ago but don't remember what the result looked like. I will try it again and get back.

But I tried this with the randomized unit test without dynamic tag info and the number of requests dequeued from the client was lower than expected but not by much. I suspect on an actual cluster we may observe lower than expected IOPS.

Let me run a few tests on a cluster and get back with more details.

sseshasa · 2024-12-09T11:29:04Z

reqtag_updt_f would be updating memory shared between queues, right?

Yes, that's correct. The following block diagram should help clarify things with respect to which entity maintains and creates the associated data structures and dynamic functions. More details will be added to this diagram as things progress:

sseshasa · 2024-12-09T18:21:43Z

Let me run a few tests on a cluster and get back with more details.

With a single client, I ran a test using rados bench with your suggestion i.e., divide res and lim by number of shards. The results show that client is unable to achieve the set limit. Res was set to 125 IOPS and Limit to 625 IOPS and wgt set to 1. On each shard, res and lim was divided by the number of shards. For comparison, I ran the test with a single shard. Here's the outcome:

client QoS: [res:125 wgt:1 lim:625]

With 5 shards & res and limit reduced by num_shards

The average IOPS is off by around 70 IOPS as shown below:

Total time run:         120.086
Total writes made:      66737
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     2.17087
Stddev Bandwidth:       0.0778568
Max bandwidth (MB/sec): 2.32812
Min bandwidth (MB/sec): 1.92969
Average IOPS:           555
Stddev IOPS:            19.9313
Max IOPS:               596
Min IOPS:               494
Average Latency(s):     0.0287745
Stddev Latency(s):      0.0245614
Max latency(s):         0.137659
Min latency(s):         0.000275006

With 1 OSD shard

Total time run:         120.026
Total writes made:      74925
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     2.43845
Stddev Bandwidth:       0.0278175
Max bandwidth (MB/sec): 2.46875
Min bandwidth (MB/sec): 2.15625
Average IOPS:           624
Stddev IOPS:            7.12127
Max IOPS:               632
Min IOPS:               552
Average Latency(s):     0.0256237
Stddev Latency(s):      0.0040078
Max latency(s):         0.143611
Min latency(s):         0.000396659

sseshasa added 5 commits November 19, 2024 20:26

src/dmclock_recs.h: Resolve compilation failure

d7202ab

Include cstdint.h to resolve compilation failures for e.g., dmclock/src/dmclock_recs.h:26:21: error: ‘uint64_t’ does not name a type 26 | using Counter = uint64_t; | ^~~~~~~~ Signed-off-by: Sridhar Seshasayee <[email protected]>

sseshasa force-pushed the wip-fix-multiple-queue-tag-fix branch from d11cbbe to 9095b48 Compare November 25, 2024 13:48

sseshasa changed the title ~~Wip fix multiple queue tag fix~~ Implement dynamic tag info functionality for accurate tag calculation with multiple queues configured per server Nov 25, 2024

sseshasa force-pushed the wip-fix-multiple-queue-tag-fix branch from 9095b48 to 4ef01f6 Compare November 25, 2024 15:36

sseshasa marked this pull request as ready for review November 25, 2024 15:37

sseshasa requested review from athanatos, ivancich, rzarzynski and neha-ojha November 25, 2024 15:37

sseshasa added the enhancement label Nov 25, 2024

sseshasa force-pushed the wip-fix-multiple-queue-tag-fix branch from 4ef01f6 to 08bb47b Compare November 26, 2024 05:36

sseshasa force-pushed the wip-fix-multiple-queue-tag-fix branch from 08bb47b to a68aa26 Compare December 3, 2024 07:51

sseshasa force-pushed the wip-fix-multiple-queue-tag-fix branch from a68aa26 to a619a46 Compare December 4, 2024 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement dynamic tag info functionality for accurate tag calculation with multiple queues configured per server #92

Implement dynamic tag info functionality for accurate tag calculation with multiple queues configured per server #92

sseshasa commented Nov 19, 2024 •

edited

Loading

sseshasa commented Nov 25, 2024 •

edited

Loading

athanatos commented Dec 6, 2024

athanatos commented Dec 6, 2024

sseshasa commented Dec 9, 2024

sseshasa commented Dec 9, 2024 •

edited

Loading

sseshasa commented Dec 9, 2024 •

edited

Loading

Implement dynamic tag info functionality for accurate tag calculation with multiple queues configured per server #92

Are you sure you want to change the base?

Implement dynamic tag info functionality for accurate tag calculation with multiple queues configured per server #92

Conversation

sseshasa commented Nov 19, 2024 • edited Loading

Problem Description

Proposed Solution

sseshasa commented Nov 25, 2024 • edited Loading

athanatos commented Dec 6, 2024

athanatos commented Dec 6, 2024

sseshasa commented Dec 9, 2024

sseshasa commented Dec 9, 2024 • edited Loading

sseshasa commented Dec 9, 2024 • edited Loading

sseshasa commented Nov 19, 2024 •

edited

Loading

sseshasa commented Nov 25, 2024 •

edited

Loading

sseshasa commented Dec 9, 2024 •

edited

Loading

sseshasa commented Dec 9, 2024 •

edited

Loading