[Feature Request] Improve performance of range queries #13566

harshavamsi · 2024-05-06T17:34:20Z

Is your feature request related to a problem? Please describe

I've been thinking of how we can improve the performance of certain types of range queries(non scoring to start with). Consider https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/main/big5/operations/default.json#L32C1-L61C7. These are types of queries a user might have when they first load up a dashboard. Give me all the events that took place in the last 30 days, last 24 hours, etc.

Range queries on timestamps are common use cases for such dashboard events. Typically these are non-scoring queries -- queries that don't have another clause in them that force scoring of documents. For example, if this range queries was used in conjunction/disjunction with a text query, we would need to score all the documents in their order of relevance. Scoring + filtering on a range is a time consuming event and since the introduction of Lucene's IndexOrDocValuesQuery, have become much faster thanks to the use of doc_values in certain cases.

For the non-scoring cases, I feel we can do better. Today Lucene scores all documents in a segment but collects only 10. OpenSearch enforces that we collect 10,000 documents. By default, a search without the size attribute returns 10 hits but can be scrolled to get up-to 10,000 hits. What if we only score 10,000 hits instead of all the documents in the segment? This could significantly speed up these non-scoring queries.

Describe the solution you'd like

Similar to how we use other Lucene classes in OpenSearch, we could override the PointRangeQuery class. During search time, we use the searchContext to figure out the query shape and if it a simple range query. If it fits the description, we use the size attribute to determine the number of documents to collect. If size > 10,000, we collect size else we collect min(size, 10,000).

We override Lucene's intesectVisitor to stop intersecting after collecting the number of hits we want and then start collecting. Then we can override the range queries in the field mappers to point to this new query type.

Related component

Search:Performance

Describe alternatives you've considered

Alternative is to do nothing. But this could yield promising results.

Additional context

No response

The text was updated successfully, but these errors were encountered:

andrross · 2024-05-08T15:24:54Z

[Triage - attendees 1 2 3 4]
@harshavamsi Thanks for filing. Can you share any benchmark numbers if you have a rough prototype of this optimization?

harshavamsi · 2024-05-13T21:54:09Z

Ran some benchmarks on the range query on the big5 workload.

m5.12xlarge instance with 1 shard and 3 segments

before with default scoring and termination

|                                        50th percentile latency |  range |     243.858 |     ms |
|                                        90th percentile latency |  range |     251.206 |     ms |
|                                        99th percentile latency |  range |     252.733 |     ms |
|                                       100th percentile latency |  range |      257.93 |     ms |
|                                   50th percentile service time |  range |     242.703 |     ms |
|                                   90th percentile service time |  range |     249.837 |     ms |
|                                   99th percentile service time |  range |     251.789 |     ms |
|                                  100th percentile service time |  range |     256.552 |     ms |
|                                                     error rate |  range |           0 |      % |

After applying the optimization and terminating early after scoring 10,000 docs

|                                        50th percentile latency |  range |     66.1852 |     ms |
|                                        90th percentile latency |  range |     71.4432 |     ms |
|                                        99th percentile latency |  range |     75.3114 |     ms |
|                                       100th percentile latency |  range |     75.6549 |     ms |
|                                   50th percentile service time |  range |     64.9487 |     ms |
|                                   90th percentile service time |  range |     70.0652 |     ms |
|                                   99th percentile service time |  range |     73.6459 |     ms |
|                                  100th percentile service time |  range |     74.5425 |     ms |
|                                                     error rate |  range |           0 |      % |

jainankitk · 2024-05-16T17:36:19Z

By default, a search without the size attribute returns 10 hits but can be scrolled to get up-to 10,000 hits. What if we only score 10,000 hits instead of all the documents in the segment? This could significantly speed up these non-scoring queries.

Is the optimization applicable for queries without size parameter. If not, how many documents are you planning to score if the size is say 10k?

Also it might be okay to make this default behavior, should we have parameter for high precision use cases?

harshavamsi · 2024-05-23T22:14:31Z

Some updated numbers after more optimization

|                                        50th percentile latency |  range |     9.73964 |     ms |
|                                        90th percentile latency |  range |     10.3399 |     ms |
|                                        99th percentile latency |  range |     15.5541 |     ms |
|                                       100th percentile latency |  range |     18.8279 |     ms |
|                                   50th percentile service time |  range |     8.44937 |     ms |
|                                   90th percentile service time |  range |     8.85817 |     ms |
|                                   99th percentile service time |  range |     14.5879 |     ms |
|                                  100th percentile service time |  range |     17.0375 |     ms |

jainankitk · 2024-05-23T22:17:28Z

@harshavamsi - Thank you for sharing these numbers. Look amazing! Can you provide more details on further optimizations?

harshavamsi · 2024-05-23T22:24:05Z

@jainankitk I realized while writing the intersect function that I could do better by returning 0 here instead of returning pointValues.size(). This I believe causes us to terminate faster although I am still wrapping my head around why it is that much more faster.

reta · 2024-06-11T21:11:17Z

@harshavamsi I am moving it off to 2.16.0 since we well over the deadline for a release cut (please feel free to discuss that with release team / maintainers).

getsaurabh02 · 2024-09-06T16:50:17Z

@harshavamsi are we good to closet this issue now?

harshavamsi · 2024-09-09T16:04:02Z

@getsaurabh02 yes!

harshavamsi added enhancement Enhancement or improvement to existing feature or request untriaged labels May 6, 2024

github-actions bot added the Search:Performance label May 6, 2024

github-project-automation bot added this to Search Project Board May 6, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board May 6, 2024

harshavamsi added this to Performance Roadmap May 6, 2024

github-project-automation bot moved this to Todo in Performance Roadmap May 6, 2024

harshavamsi moved this from Todo to In Progress in Performance Roadmap May 6, 2024

harshavamsi added the v2.15.0 Issues and PRs related to version 2.15.0 label May 6, 2024

andrross removed the untriaged label May 8, 2024

harshavamsi mentioned this issue May 22, 2024

Introduce ApproximateRangeQuery and ApproximateableQuery #13788

Merged

9 tasks

reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.16.0 Issues and PRs related to version 2.16.0 and removed v2.15.0 Issues and PRs related to version 2.15.0 labels Jun 11, 2024

getsaurabh02 added the v2.17.0 label Jul 29, 2024

mch2 assigned harshavamsi Aug 5, 2024

getsaurabh02 moved this from In Progress to In-Review in Performance Roadmap Aug 5, 2024

mch2 removed the v2.16.0 Issues and PRs related to version 2.16.0 label Aug 5, 2024

getsaurabh02 added the Roadmap:Search Project-wide roadmap label label Aug 12, 2024

opensearch-infra bot added this to OpenSearch Roadmap Aug 12, 2024

github-project-automation bot moved this to Issues and PR's in OpenSearch Roadmap Aug 12, 2024

getsaurabh02 added the RFC Issues requesting major changes label Aug 12, 2024

getsaurabh02 moved this from Now(This Quarter) to 🏗 In progress in Search Project Board Aug 19, 2024

github-project-automation bot added this to OpenSearch Project Roadmap Aug 30, 2024

github-project-automation bot moved this to 2.17 (First RC 09/03, Release 09/17) in OpenSearch Project Roadmap Aug 30, 2024

getsaurabh02 moved this from In-Review to Done in Performance Roadmap Sep 3, 2024

harshavamsi closed this as completed Sep 9, 2024

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Search Project Board Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Improve performance of range queries #13566

[Feature Request] Improve performance of range queries #13566

harshavamsi commented May 6, 2024

andrross commented May 8, 2024

harshavamsi commented May 13, 2024

jainankitk commented May 16, 2024

harshavamsi commented May 23, 2024

jainankitk commented May 23, 2024

harshavamsi commented May 23, 2024

reta commented Jun 11, 2024

getsaurabh02 commented Sep 6, 2024

harshavamsi commented Sep 9, 2024

[Feature Request] Improve performance of range queries #13566

[Feature Request] Improve performance of range queries #13566

Comments

harshavamsi commented May 6, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

andrross commented May 8, 2024

harshavamsi commented May 13, 2024

jainankitk commented May 16, 2024

harshavamsi commented May 23, 2024

jainankitk commented May 23, 2024

harshavamsi commented May 23, 2024

reta commented Jun 11, 2024

getsaurabh02 commented Sep 6, 2024

harshavamsi commented Sep 9, 2024