-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Dynamically enable concurrent search on search requests at runtime #14781
Comments
Thanks @Gankris96 for the RFC! The way I see it we're trying to address a few different things at once here:
I think something like #9491 will be a great first step in terms of adoption to let users more easily try out concurrent segment search on their existing workloads. Some specific questions:
Are you thinking this decision layer would be provided as a plugin or as a part of core? Or could it even be implemented by something like adding plugin hooks to the existing search backpressure interfaces It would be great to get a list of the tuning knobs that you are thinking would be available to the user. As it is search backpressure has so many tuning knobs that aren't super clear what each knob even does so it would be best to be as concise and precise as possible with those. @reta I know you had some thoughts you shared during the weekly search meeting as well, would be great if you could share again here too |
Thank @jed326 and @reta In terms of the decision layer itself - I envisioned it to be within core itself, probably similar to search backpressure service for example, because I dont know that there would be use of this outside of the search flow. We can provide capability to have pluggable deciders to add onto existing deciders. |
+1 to this, the goal here is to reduce user involvement in deciding whether to use concurrent vs sequential execution path. OpenSearch execution layer should be able to make that choice for the user instead of user driving it (which is what index or cluster level option provides)
We will need to provide pluggable mechanism such that for different use cases, some other decision logic can be integrated by plugins. We can keep the generic decision making components like cpu utilization in core deciders. For example: In case of KNN, it can look into native memory usage to decide on when to not enable concurrent search on more requests. It will be separate from SearchBackPressureService, as this is specific to choosing a query execution mechanism. |
I still think that building the capability to work independent of user intervention or with minimal user intervention is what we should aim for. Providing a request level capability means we anyway rely on user sending in requests with this parameter which defeats the purpose. I am going to be looking into the low level design for this so please let me know if you have further thoughts/concerns on this. |
@Gankris96 I don't think anyone is arguing with that. I think the concern is - there are no clear metrics / measurements / signals to build that right now. Fe., I see only CPU mentioned here, but I think it is far from being sufficient: we don't look at the thread pool utilization that concurrent search is using (it may just queue all the work). This is just one of the examples. What I think is missed here is:
I am sure there is much more to that. |
Thanks @reta. I did think about the thread pool utilization too but I don't think there is a direct correlation in terms of the performance or at least a visible one. Do we need to care about thread_pool queue increasing if the search latency is still improving anyway? If there really is a hit in performance i guess it would consequently show up in the concurrent search avg latency becoming worse and the decider would have a way to detect that and reduce the number of requests getting run with concurrent segment search so that there is no performance degradation due to queuing. Do you see other issues because of this? Currently, the major limiting factor for concurrent search based on our benchmarking in resource availability and specifically CPU. Anyway, I was thinking of this in terms of making an initial step towards having concurrent search dynamically enabled under certain conditions to increase the feature adoption. Additional decision parameters can be added to make the decision making more comprehensive. Let me think more on the computational model for concurrent search and index geometry points you brought up as well. |
I think this is fine for now since it's called out that we will expose pluggable deciders so we can keep refining this system with additional parameters in the future. |
Create a sub issue that tracks part of the changes described in this Issue here: #15259 |
Is your feature request related to a problem? Please describe
TL;DR
I am proposing a mechanism to increase the adoption of concurrent segment search by dynamically enabling it for user at runtime on a per request basis based on certain factors --- such as request type and resource availability; currently starting off with CPU utilization and with ability to add more deciding parameters in the future.
Background Overview
With the introduction of Concurrent Segment Search we have seen good improvement in search performance.
However, currently the feature is disabled by default and users have to go and enable the feature manually by enabling one of cluster level setting or index level setting. ref
This means that users need to take manual action to enable concurrent search for their workloads in OpenSearch.
However, based on the performance benchmarking done, we have the following observations -
Given these observations, we can enable concurrent segment search whenever possible to efficiently utilize resource and get performance benefits.
Any overrides that user has set will be honored. This feature is catered more towards the cases where concurrent search feature has not been explicitly set by the user.
Describe the solution you'd like
At a high level, introduce an additional decision layer that runs as a separate component and monitors the metrics such that it can be retrieved as a simple in-memory lookup at runtime during a search request. This will be done with the tenets that dynamic decision making needs to be fast and not add additional latency to the search request.
Each search request will query this decision layer to get a decision of whether to execute the search request via concurrent search path or not. The decision making happens at a node level, which means each shard search request has an independent decision.
The decision layer would be a composite decision similar to
AllocationDecider
for example, with multiple factors providing their decision. The initial version would probably have a request type decider and a resource utilization based decider. The request type decider is to initially target specific type of requests such as aggs with possibility to expand to more requests in the future. This provides us some initial control over enabling this feature.However, for the rest of this proposal, I will focus on the non-trivial logic of resource based decision making.
Given this abstraction, I am proposing the following algorithm for dynamic decision making
The decider monitors CPU utilization as well as overall search latency and concurrent search latency over a sliding window of size
N
.Along with this, the decider will also have an initial percentage
X%
to control how many reqests in a time window will be allowed to be executed with concurrent segment search. This initial percentage can be set based on theSearchRate
metric on the cluster, with starting value close to minSearchRate
The decision algorithm would look like this:
If the
CPU_max
over window is below a thresholdCPU_high
, then there is sufficient resource availability to enable concurrent search. By default, theslice_count
will be set to 2.We can start by enabling concurrent search for
X%
of the requests and continue monitoring CPU as well as search latencies (both overall search latency and concurrent search latency).Decider will continue to monitor the metrics and if
CPU_max
is still lower thanCPU_high
, increase theX%
by enabling factorE
. However, if theCPU_max
is greater than or equal toCPU_high
, then decrease theX%
by decreasing factorD
. To react faster to resource constraints, we can setD = 1.5*E
.The decider also monitors concurrent search latency and overall search latency over the window
N
and if average concurrent search latency is greater than overall average search latency by a difference thresholddiff_threshold
, then concurrent search might not be providing performance benefits so we can reduce the percentageX%
by a factor ofD
again.In this way, we penalize both performance degradation as well as resource constraints and we enable in scenarios when CPU is available and performance improvements are seen.
The algorithm can also be smart to increase slice count (up to a max value
MAX_SLICE
) once most of the queries are being run in concurrent mode, determined byX
going over a threshold percentage and we still have available CPU to further improve performance.MAX_SLICE
can be set based on the instance type. Similarly, slice_count can be decreased (up to min valueMIN_SLICE=2
) ifX
goes below a certain threshold.All of the variables in the above algorithm can be configurable via cluster settings.
Related component
Search:Performance
Describe alternatives you've considered
Multi-armed bandit approach
https://en.wikipedia.org/wiki/Multi-armed_bandit
In Multi-armed bandit approach (in our case it is 2-armed Bandit), we try to balance exploration (sometimes explore concurrent_search path) and exploitation (exploit scenarios where we see good benefit with enabling concurrent search)
N
to keep track ofCPU_max
for the window.CPU_high
a. Concurrent search reward: total reward received for concurrent search over total request executed in concurrent path.
b. non-concurrent search reward: total reward received for non-concurrent search over total requests executed in non concurrent path.
CPU_max
<
CPU_high
, then grant a rewardE
. However, ifCPU_max
>=
CPU_high
then grant negative rewardD
. Non-concurrent path always gets positive rewardE
. Once again, we can haveD = 1.5*E
. Also, penalize the concurrent search reward if the avg concurrent search latency becomes greater than overall average latency by adiff_threshold
.On comparing the proposed approach and the current version of the multi-armed bandit approach; it seems to me that the two are quite similar but multi-armed bandit approach has ability to take concurrent path even at times when CPU is under duress. Also, in the first approach, we can think of looking at
CPU_max
as exploration and increasing the request percentageX%
byE
or decreasing by factorD
as a form of reward and by extension, a form of exploitation.Reinforcement learning approaches work better when the state of the system is well represented, and in the current version, we don’t have more fine-grained representations of the state, one such example being a detailed representation of the request. Furthermore, the efficiency of the multi-armed bandit strongly comes down to how robust the reward function is.
Additional context
Co-Author: @sohami
The text was updated successfully, but these errors were encountered: