Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Dynamically enable concurrent search on search requests at runtime #14781

Open
Gankris96 opened this issue Jul 17, 2024 · 9 comments
Open
Labels
enhancement Enhancement or improvement to existing feature or request Roadmap:Cost/Performance/Scale Project-wide roadmap label Search:Performance Search Search query, autocomplete ...etc

Comments

@Gankris96
Copy link
Contributor

Gankris96 commented Jul 17, 2024

Is your feature request related to a problem? Please describe

TL;DR

I am proposing a mechanism to increase the adoption of concurrent segment search by dynamically enabling it for user at runtime on a per request basis based on certain factors --- such as request type and resource availability; currently starting off with CPU utilization and with ability to add more deciding parameters in the future.

Background Overview

With the introduction of Concurrent Segment Search we have seen good improvement in search performance.
However, currently the feature is disabled by default and users have to go and enable the feature manually by enabling one of cluster level setting or index level setting. ref
This means that users need to take manual action to enable concurrent search for their workloads in OpenSearch.

However, based on the performance benchmarking done, we have the following observations -

  1. For certain type of queries such as aggregation queries we generally see improvement in search latency
  2. short running queries with latency in single digit ms, concurrent search doesn't show much improvement.
  3. CPU utilization increases with concurrent search, and increases linearly with slice count

Given these observations, we can enable concurrent segment search whenever possible to efficiently utilize resource and get performance benefits.
Any overrides that user has set will be honored. This feature is catered more towards the cases where concurrent search feature has not been explicitly set by the user.

Describe the solution you'd like

At a high level, introduce an additional decision layer that runs as a separate component and monitors the metrics such that it can be retrieved as a simple in-memory lookup at runtime during a search request. This will be done with the tenets that dynamic decision making needs to be fast and not add additional latency to the search request.
Each search request will query this decision layer to get a decision of whether to execute the search request via concurrent search path or not. The decision making happens at a node level, which means each shard search request has an independent decision.

The decision layer would be a composite decision similar to AllocationDecider for example, with multiple factors providing their decision. The initial version would probably have a request type decider and a resource utilization based decider. The request type decider is to initially target specific type of requests such as aggs with possibility to expand to more requests in the future. This provides us some initial control over enabling this feature.
However, for the rest of this proposal, I will focus on the non-trivial logic of resource based decision making.

Given this abstraction, I am proposing the following algorithm for dynamic decision making

The decider monitors CPU utilization as well as overall search latency and concurrent search latency over a sliding window of size N.
Along with this, the decider will also have an initial percentage X% to control how many reqests in a time window will be allowed to be executed with concurrent segment search. This initial percentage can be set based on the SearchRate metric on the cluster, with starting value close to min SearchRate

The decision algorithm would look like this:

If the CPU_max over window is below a threshold CPU_high, then there is sufficient resource availability to enable concurrent search. By default, the slice_count will be set to 2.
We can start by enabling concurrent search for X% of the requests and continue monitoring CPU as well as search latencies (both overall search latency and concurrent search latency).

Decider will continue to monitor the metrics and if CPU_max is still lower than CPU_high, increase the X% by enabling factor E. However, if the CPU_max is greater than or equal to CPU_high, then decrease the X% by decreasing factor D. To react faster to resource constraints, we can set D = 1.5*E.

The decider also monitors concurrent search latency and overall search latency over the window N and if average concurrent search latency is greater than overall average search latency by a difference threshold diff_threshold, then concurrent search might not be providing performance benefits so we can reduce the percentage X% by a factor of D again.
In this way, we penalize both performance degradation as well as resource constraints and we enable in scenarios when CPU is available and performance improvements are seen.

The algorithm can also be smart to increase slice count (up to a max value MAX_SLICE) once most of the queries are being run in concurrent mode, determined by X going over a threshold percentage and we still have available CPU to further improve performance. MAX_SLICE can be set based on the instance type. Similarly, slice_count can be decreased (up to min value MIN_SLICE=2) if X goes below a certain threshold.

Decision loop:
    if CPU_max < CPU_high:
        #increase percentage by factor E    
        X = X + (X * E)
    else:
        #decrease percentage by factor D
        X = X - (X * D)

    if avg_concurrent_search_latency > avg_overall_search_latency:
        if diff > diff_threshold:
            X = X - (X * D)

    if X > T:
        #increase slice count
        slice_count = max(MAX_SLICE, slice_count * 2)
    if X < B:
        # decrease slice count
        slice_count = min(MIN_SLICE=2, slice_count / 2)

All of the variables in the above algorithm can be configurable via cluster settings.

Related component

Search:Performance

Describe alternatives you've considered

Multi-armed bandit approach

https://en.wikipedia.org/wiki/Multi-armed_bandit

In Multi-armed bandit approach (in our case it is 2-armed Bandit), we try to balance exploration (sometimes explore concurrent_search path) and exploitation (exploit scenarios where we see good benefit with enabling concurrent search)

  1. Monitor CPU and latency metrics similar to approach above over sliding window of size N to keep track of CPU_max for the window.
  2. Have a threshold for CPU utilization CPU_high
  3. Now the bandit algorithm will make decision based on 2 factors
    1. Exploration: randomly route some requests to concurrent_search path based on random probability
    2. Exploitation: Have a reward trade-off factor comparing concurrent_search efficiency vs non-concurrent search efficiency.
    3. In this case we use the ε - greedy approach of balancing exploration vs exploitation.
  4. Initial Reward is 0 for both concurrent search and non-concurrent search.
  5. Then based on how search requests are performing, we will provide rewards to either of the 2 paths (concurrent vs non-concurrent).
  6. The reward is calculated as follows
    a. Concurrent search reward: total reward received for concurrent search over total request executed in concurrent path.
    b. non-concurrent search reward: total reward received for non-concurrent search over total requests executed in non concurrent path.
  7. Upon request completion, based on the metrics being monitored, the rewards are updated as follows - if CPU_max < CPU_high, then grant a reward E . However, if CPU_max >= CPU_high then grant negative reward D. Non-concurrent path always gets positive reward E. Once again, we can have D = 1.5*E . Also, penalize the concurrent search reward if the avg concurrent search latency becomes greater than overall average latency by a diff_threshold.
  8. Slice count can also be adaptively modified based on rewards.
# Initial rewards will be 0 for both
total_reward_non_concurrent=0
total_reward_concurrent=0


function select_path():
    # explore 
    if random < some_arbitrary_threshold:
        return concurrent_path
    else:
        # exploitation
        return (total_reward_concurrent / concurrent_count) > (total_reward_non_concurrent / non_concurrent_count)
        ## whichever was giving more rewards, i.e. better performance, we will take that path.
    

Decider logic:
	path = select_path()
	return path

Post decision update:
## post path selection, we can update variables for subsequent requests
if path == concurrent_path:
	concurrent_count += 1
	if CPU_max < CPU_high:
		reward_concurrent += E
	else:
		reward_concurrent -= D
	if avg_concurrent_search_latency > avg_overall_search_latency:
        if diff > diff_threshold:
            reward_concurrent -= D
else:
	reward_non_concurrent += E

On comparing the proposed approach and the current version of the multi-armed bandit approach; it seems to me that the two are quite similar but multi-armed bandit approach has ability to take concurrent path even at times when CPU is under duress. Also, in the first approach, we can think of looking at CPU_max as exploration and increasing the request percentage X% by E or decreasing by factor D as a form of reward and by extension, a form of exploitation.
Reinforcement learning approaches work better when the state of the system is well represented, and in the current version, we don’t have more fine-grained representations of the state, one such example being a detailed representation of the request. Furthermore, the efficiency of the multi-armed bandit strongly comes down to how robust the reward function is.

Additional context

Co-Author: @sohami

@Gankris96 Gankris96 added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 17, 2024
@github-project-automation github-project-automation bot moved this to Issues and PR's in OpenSearch Roadmap Jul 17, 2024
@jed326 jed326 added the Search Search query, autocomplete ...etc label Jul 17, 2024
@mch2 mch2 removed the untriaged label Jul 17, 2024
@jed326
Copy link
Collaborator

jed326 commented Jul 18, 2024

Thanks @Gankris96 for the RFC!

The way I see it we're trying to address a few different things at once here:

  1. Increase adoption of concurrent segment search (ElasticSearch comes with their version enabled by default for example: https://www.elastic.co/search-labs/blog/elasticsearch-opensearch-vector-search-performance-comparison#out-of-the-box-and-concurrent-segment-search)
  2. Do so in a way that does not lead to performance regressions for users, whether it be due to query type or due to resource utilization.

I think something like #9491 will be a great first step in terms of adoption to let users more easily try out concurrent segment search on their existing workloads.

Some specific questions:

At a high level, introduce an additional decision layer that runs as a separate component and monitors the metrics such that it can be retrieved as a simple in-memory lookup at runtime during a search request. This will be done with the tenets that dynamic decision making needs to be fast and not add additional latency to the search request.

Are you thinking this decision layer would be provided as a plugin or as a part of core? Or could it even be implemented by something like adding plugin hooks to the existing search backpressure interfaces

It would be great to get a list of the tuning knobs that you are thinking would be available to the user. As it is search backpressure has so many tuning knobs that aren't super clear what each knob even does so it would be best to be as concise and precise as possible with those.


@reta I know you had some thoughts you shared during the weekly search meeting as well, would be great if you could share again here too

@reta
Copy link
Collaborator

reta commented Jul 19, 2024

@reta I know you had some thoughts you shared during the weekly search meeting as well, would be great if you could share again here too

Thankls @jed326 , you have perfectly outlined the same concerns I have, +1 to start with #9491 as low hanging fruit.

@Gankris96
Copy link
Contributor Author

Thank @jed326 and @reta
I agree that request level parameter #9491 would be helpful as a first step. However, my thoughts on this is to minimize user level involvement as much as possible.
I see this helping in cases where the user doesn't have access to modify cluster settings for example. However, the user has to identify the right category of requests to use this parameter with, if not could potentially lead to resource utilization spiking.

In terms of the decision layer itself - I envisioned it to be within core itself, probably similar to search backpressure service for example, because I dont know that there would be use of this outside of the search flow. We can provide capability to have pluggable deciders to add onto existing deciders.

@sohami
Copy link
Collaborator

sohami commented Jul 23, 2024

I agree that request level parameter #9491 would be helpful as a first step. However, my thoughts on this is to minimize user level involvement as much as possible.

+1 to this, the goal here is to reduce user involvement in deciding whether to use concurrent vs sequential execution path. OpenSearch execution layer should be able to make that choice for the user instead of user driving it (which is what index or cluster level option provides)

We can provide capability to have pluggable deciders to add onto existing deciders.

We will need to provide pluggable mechanism such that for different use cases, some other decision logic can be integrated by plugins. We can keep the generic decision making components like cpu utilization in core deciders. For example: In case of KNN, it can look into native memory usage to decide on when to not enable concurrent search on more requests. It will be separate from SearchBackPressureService, as this is specific to choosing a query execution mechanism.

@Gankris96
Copy link
Contributor Author

I still think that building the capability to work independent of user intervention or with minimal user intervention is what we should aim for. Providing a request level capability means we anyway rely on user sending in requests with this parameter which defeats the purpose. I am going to be looking into the low level design for this so please let me know if you have further thoughts/concerns on this.
The decision layer would be within core with a pluggable interface for external plugins to add their own decision layers and the combination of these deciders will result in dynamically toggling on/off concurrent search.

@jed326 @reta @sohami

@reta
Copy link
Collaborator

reta commented Jul 30, 2024

I still think that building the capability to work independent of user intervention or with minimal user intervention is what we should aim for.

@Gankris96 I don't think anyone is arguing with that. I think the concern is - there are no clear metrics / measurements / signals to build that right now. Fe., I see only CPU mentioned here, but I think it is far from being sufficient: we don't look at the thread pool utilization that concurrent search is using (it may just queue all the work). This is just one of the examples.

What I think is missed here is:

I am sure there is much more to that.

@Gankris96
Copy link
Contributor Author

Thanks @reta. I did think about the thread pool utilization too but I don't think there is a direct correlation in terms of the performance or at least a visible one. Do we need to care about thread_pool queue increasing if the search latency is still improving anyway? If there really is a hit in performance i guess it would consequently show up in the concurrent search avg latency becoming worse and the decider would have a way to detect that and reduce the number of requests getting run with concurrent segment search so that there is no performance degradation due to queuing. Do you see other issues because of this?

Currently, the major limiting factor for concurrent search based on our benchmarking in resource availability and specifically CPU.

Anyway, I was thinking of this in terms of making an initial step towards having concurrent search dynamically enabled under certain conditions to increase the feature adoption. Additional decision parameters can be added to make the decision making more comprehensive.

Let me think more on the computational model for concurrent search and index geometry points you brought up as well.

@jed326
Copy link
Collaborator

jed326 commented Jul 31, 2024

Fe., I see only CPU mentioned here, but I think it is far from being sufficient: we don't look at the thread pool utilization that concurrent search is using (it may just queue all the work).

I think this is fine for now since it's called out that we will expose pluggable deciders so we can keep refining this system with additional parameters in the future.

@Gankris96
Copy link
Contributor Author

Create a sub issue that tracks part of the changes described in this Issue here: #15259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Roadmap:Cost/Performance/Scale Project-wide roadmap label Search:Performance Search Search query, autocomplete ...etc
Projects
Status: Todo
Status: Done
Status: Now(This Quarter)
Development

No branches or pull requests

7 participants