-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement dynamic tag info functionality for accurate tag calculation with multiple queues configured per server #92
base: master
Are you sure you want to change the base?
Conversation
Include cstdint.h to resolve compilation failures for e.g., dmclock/src/dmclock_recs.h:26:21: error: ‘uint64_t’ does not name a type 26 | using Counter = uint64_t; | ^~~~~~~~ Signed-off-by: Sridhar Seshasayee <[email protected]>
Add a couple of tests using immediate and delayed tag calculation to demonstrate inaccurate QoS provided to a client when multiple queues are configured per server. The requests from the client are distributed across multiple queues. Since the queues are independent, the tag calculations are also independent and this results in inaccurate QoS provided to the client as the tests show. The following scenario outlines problem with multiple queues per server. Tag Calculation with Single Queue: ---------------------------------- Enqueue--->|T5|T4|T3|T2|T1|T0|--->Dequeue Tags for items in Single Queue Where, Tn = Tag value at time = n For simplicity, consider x to be any of the client info parameters i.e. reservation, weight or limit 1. Tag calculation for T1 in the queue would look like: T1 = max((T0 + (1/x)), arr_time) where, T0 - Tag at time 0 x - client info [res|wgt|lim], and, arr_time - Arrival time of the request in the tag 2. Similarly, Tag calculation for T5: T5 = max((T4 + (1/x)), arr_time) As seen above, there is no issue when a single mClock queue is in operation. For all the requests, the tags are calculated based on their arrival times and in sequence. Tag Calculation with Multiple Queues per Server: ----------------------------------------------- Consider two queues configured on the same server with items added in the following sequence: Enqueue--->|T5|T2|T0|--->Dequeue Queue0 Enqueue--->|T4|T3|T1|--->Dequeue Queue1 Consider the request arrival times in each queue according to the number associated with each tag, for example: - Req0 arrives on queue0 at time t0 - Req1 arrives on queue1 at time t1 - Req2 arrives on queue0 at time t2 and so on with the final Req5 arriving on queue0 at time t5 Tag calcs on Queue0 | Tag calcs on Queue1 ------------------------------------|--------------------------------- T0 = max((NO_TAG + (1/x)), arr_time)| T1 = max((NO_TAG + (1/x)), arr_time) T2 = max((T0 + (1/x)), arr_time) | T3 = max((T1 + (1/x)), arr_time) T5 = max((T2 + (1/x)), arr_time) | T4 = max((T3 + (1/x)), arr_time) For simplicity, assume reservation and limit is set to 3 IOPS, AtLimit is set to 'Wait' and all requests arrived well within a few milliseconds on both queues. It's pretty clear that all the 3 requests from both the queues will be scheduled resulting in 6 IOPS which is not correct. Signed-off-by: Sridhar Seshasayee <[email protected]>
A function to verify that a tag is valid would be particularly useful when handling delayed tag calculation involving requests from clients distributed across multiple queues. Signed-off-by: Sridhar Seshasayee <[email protected]>
A new structure called ReqTagInfo is introduced to enable clients of dmClock to share the latest RequestTag and the latest tick interval among the set of queues spawned per server. The following are the parameters and their purpose: 1. last_tag: (read/write) This parameter holds the latest calculated tag across all the queues. This forms the base tag for a server to calculate the next tag when a new request is added to its queue. This parameter is both read/write from a server standpoint. 2. last_tick_interval: The total number of requests handled by other queues on a server before the request on a given queue. This is updated by dmClock clients and read by the server to calculate the next accurate tag. Signed-off-by: Sridhar Seshasayee <[email protected]>
The groundwork to set up dynamic tag info is similar to dynamic client info with a key difference. The client info relies on only one function to enable clients to update the client info parameters. But dynamic tag info via is_dynamic_tag_info_f relies on two functions outlined below to enable both the client and server to update and get the necessary information to calculate the next accurate tag. The client of dmClock is expected to maintain a map of clientId to ReqTagInfo similar to ClientInfo. The two dynamic tag info functions are: 1) reqtag_updt_f: The server uses this function to update the latest calculated tag to the dmClock client. This is called whenever a new tag is calculated either as part of the initial_tag() or update_next_tag() depending on the type tag calculation employed. The client on the other hand updates the 'last_tick_interval' in the map just before adding a new request to the queue. 2) reqtag_info_f: The server uses this function to get the latest tag for a clientId by reading the 'last_tag' and the latest tick interval (for DelayedTagCalc) via 'last_tick_interval'. This information is used to calculate the next accurate tag. To enable the above functionality, additional PriorityQueueBase and PullPriorityQueue constructors are defined for dmClock clients to appropriately set up the dynamic functions and the associated data structures. A bool U2 (false by default) is used to tell the server to enable dynamic tag info functionality. This is propagated accordingly to both the PullPriorityQueue and PushPriorityQueue constructors. However, it is important to note that the dynamic tag info functionality is currently only implemented for PullPriorityQueue. Signed-off-by: Sridhar Seshasayee <[email protected]>
d11cbbe
to
9095b48
Compare
9095b48
to
4ef01f6
Compare
@athanatos @ivancich @rzarzynski @neha-ojha Please take time to review this PR. I know most of you will be out for Cephalocon. In the meantime, I am hoping to get the changes on the Ceph side ready and get some real tests going. Thanks! |
4ef01f6
to
08bb47b
Compare
08bb47b
to
a68aa26
Compare
Similar to ClientInfo, a pointer(tag_info) to ReqTagInfo is maintained within the ClientRec structure and read/written as appropriate via the dynamic tag info functions. The description of the fix is better understood with the following example. The fix is applicable to both DelayedTagCalc and ImmediateTagCalc. Note that for ImmediateTagCalc, the tick_interval is not used since the tag is always calculated before it's added to the queue and therefore only the last_tag from ReqTagInfo is necessary. Therefore in the case of ImmediateTagCalc, tick_interval will always be 0. Consider two queues configured on the same server with items added in the following sequence. Consider that DelayedTagCalc is enabled: Enqueue--->|T5|T2|T0|--->Dequeue Queue0 Enqueue--->|T4|T3|T1|--->Dequeue Queue1 Consider the request arrival times in each queue according to the number associated with each tag, for example: - Req0 arrives on queue0 at time t0 - Req1 arrives on queue1 at time t1 - Req2 arrives on queue0 at time t2 and so on with the final Req5 arriving on queue0 at time t5 Pre-conditions: --------------- - Dynamic tag info functionality (bool U2 in the pull constructor) is enabled (is_dynamic_tag_info_f). - The dmClock client maintains a mapping of clientId to ReqTagInfo for each client. - The dmClock client registers custom implementations for reqtag_updt_f and reqtag_info_f. Solution: --------- The fix involves enabling the dynamic tag info for two operations by the following entities: - the server - to update (using reqtag_updt_f) the dmClock client with latest tag for a clientId on a given queue once it's calculated. This is called after tags are calculated in initial_tag() and update_next_tag(). - the dmClock client - to calculate and update the tick interval before adding a request to the mClock queue. For a given queue, the tick interval is the number of request(s) added to other mClock queues before the current one arrives on itself. The latest tick interval is read by the server using reqtag_info_f before calculating the initial_tag. See calc_interval_tag() on how the new tag is calculated and see the unit tests for an example on how the client calculates the tick interval before adding the request. The calculation of each tag is shown below with the fix in place. - Consider 'x' to be any of the client info parameters i.e. reservation, weight or limit - Consider arr_time to be the arrival time of a request Tag calcs on Queue0 | Tag calcs on Queue1 ------------------------------------|--------------------------------- tick interval i = 0 | tick interval i = 1 because req1 | arrived after req0 on Queue0. T0 = max((NO_TAG + (1/x)), arr_time)| T1 = max((T0 + (i/x)), arr_time) | [Note: If tick_inteval > 0, the tag | is calculated as part of | initial_tag() for DelayedTagCalc] -------------------------------------------------------------------------- tick interval i = 1 | tick interval i = 1 T2 = max((T1 + (i/x)), arr_time) | T3 = max((T2 + (i/x)), arr_time) -------------------------------------------------------------------------- tick interval i = 2 because req5 | tick interval i = 0 because req4 arrived two requests after T2. | follows immediately after req3 on | the same queue. T5 = max((T3 + (i/x)), arr_time) | T4 = max((T3 + (1/x)), arr_time) | [Note: If tick_inteval == 0, the tag | is calculated as part of | update_next_tag() if DelayedTagCalc | is enabled.] For simplicity, assume reservation and limit is set to 3 IOPS, AtLimit is set to 'Wait' and all requests arrived within a second on both queues. With the fix in place only T0, T1 and T2 will be scheduled as expected during the first phase since each request is spaced 1/res apart i.e. 1/3rd of a second. The same pattern will apply for the rest of the requests in the queues thus ensuring accurate scheduling of requests across all the queues with the same server. Tests: ------ A lot of unit tests have been added to exercise the solution with both delayed and immediate tag calculation. Of particular importance are the tests that completely randomize queue additions and pulls from the queues. The tests also provide an insight into how the clients calculate the latest tick interval and how the latest tag is updated by the client via the dynamic functions. See: - pull_reservation_randomize_delydtag - pull_reservation_randomize_immtag - pull_weight_randomize_delydtag - pull_weight_randomize_immtag Signed-off-by: Sridhar Seshasayee <[email protected]>
a68aa26
to
a619a46
Compare
In your first example above, why can't we avoid the problem by simply setting per-queue reservation/limit to 1/3 (1/<num_queues>) of the value we want for the whole OSD? |
reqtag_updt_f would be updating memory shared between queues, right? |
This may not work in all scenarios. For e.g., if a subset of queues are active for a workload, the realized IOPS may be lower than what is set by the client. I recall having tried this on a cluster a while ago but don't remember what the result looked like. I will try it again and get back. But I tried this with the randomized unit test without dynamic tag info and the number of requests dequeued from the client was lower than expected but not by much. I suspect on an actual cluster we may observe lower than expected IOPS. Let me run a few tests on a cluster and get back with more details. |
Yes, that's correct. The following block diagram should help clarify things with respect to which entity maintains and creates the associated data structures and dynamic functions. More details will be added to this diagram as things progress: |
With a single client, I ran a test using rados bench with your suggestion i.e., divide res and lim by number of shards. The results show that client is unable to achieve the set limit. Res was set to 125 IOPS and Limit to 625 IOPS and wgt set to 1. On each shard, res and lim was divided by the number of shards. For comparison, I ran the test with a single shard. Here's the outcome: client QoS: [res:125 wgt:1 lim:625] With 5 shards & res and limit reduced by num_shards The average IOPS is off by around 70 IOPS as shown below:
With 1 OSD shard
|
To better understand the fix, the problem with multiple queues configured per server
needs to be understood first. This problem is currently faced by Ceph (or Ceph OSDs)
which are clients of dmClock server. Ceph creates multiple op queues per OSD (a.k.a
OSD op queue shards) and each op queue runs independently with the same server
id (OSD ID). With this configuration, tags are calculated on each queue independently.
Therefore, a client could distribute requests across the multiple mClock queues but
the server cannot meet the clients' QoS settings due to the following problem:
Problem Description
Consider two queues configured on the same server with
items added in the following sequence:
Consider the request arrival times (arr_time) in each queue according
to the number associated with each tag, for example:
and so on with the final Req5 arriving on queue0 at time t5
Consider x to be any of the client info parameters i.e. reservation, weight or limit
For simplicity, assume reservation and limit is set to 3 IOPS,
AtLimit
isset to '
Wait
' and all requests arrived within a few milliseconds onboth queues.
It's pretty clear that all the 3 requests from both the queues will be
scheduled resulting in 6 IOPS which is not correct.
Proposed Solution
The fix is better understood by applying the solution to the above problem.
The fix is applicable to both
DelayedTagCalc
andImmediateTagCalc
.Pre-conditions:
is_dynamic_tag_info_f
).clientId
toReqTagInfo
for each client.reqtag_updt_f
andreqtag_info_f
.Solution:
The fix involves enabling the dynamic tag info for two operations by the following entities:
the server - to update (using
reqtag_updt_f
) the dmClock client with latest tag for aclientId
on a given queue once it's calculated. This is called after tags are calculated ininitial_tag()
andupdate_next_tag()
.the dmClock client - to calculate and update the tick interval before adding a request to the mClock queue. For a given queue, the tick interval is the number of request(s) added to other mClock queues before the current one arriving on itself. The latest tick interval is read by the server using
reqtag_info_f
before calculating theinitial_tag()
. Seecalc_interval_tag()
on how the new tag is calculated and see the unit tests on how the client calculates the tick interval before adding the request to the queue. It's important to note that forDelayedTagCalc
, the tag will be calculated as part ofinitial_tag()
ONLY for non-zero tick intervals.The tag calculation for each tag is shown below with the fix in place:
(Apologies for the table formatting. Please consider the tag calculation row as the start of a row)
update_next_tag()
as usualFor simplicity, assume reservation and limit is set to 3 IOPS,
AtLimit
is set to 'Wait
' and all requests arrived within a second on both queues.With the fix in place only T0, T1 and T2 will be scheduled as expected during the first phase since each request is spaced
1/res
apart i.e., 1/3rd of a second. The same pattern will apply for the rest of the requests in the queues thus ensuring accurate scheduling of requests across all the queues with the same server.Tests:
A lot of unit tests have been added to exercise the solution with both delayed and immediate tag calculation. Of particular importance are the tests that completely randomize queue additions and pulls from multiple(5 Nos.) queues. The tests also provide an insight into how the clients calculate the latest tick interval and how the latest tag is updated by the client via the dynamic functions. See:
- pull_reservation_randomize_delydtag
- pull_reservation_randomize_immtag
- pull_weight_randomize_delydtag
- pull_weight_randomize_immtag
Signed-off-by: Sridhar Seshasayee [email protected]