-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Improve performance of range queries #13566
Comments
[Triage - attendees 1 2 3 4] |
Ran some benchmarks on the range query on the big5 workload. m5.12xlarge instance with 1 shard and 3 segments before with default scoring and termination
After applying the optimization and terminating early after scoring 10,000 docs
|
Is the optimization applicable for queries without Also it might be okay to make this default behavior, should we have parameter for high precision use cases? |
Some updated numbers after more optimization
|
@harshavamsi - Thank you for sharing these numbers. Look amazing! Can you provide more details on further optimizations? |
@jainankitk I realized while writing the intersect function that I could do better by returning 0 here instead of returning |
@harshavamsi I am moving it off to |
@harshavamsi are we good to closet this issue now? |
@getsaurabh02 yes! |
Is your feature request related to a problem? Please describe
I've been thinking of how we can improve the performance of certain types of range queries(non scoring to start with). Consider https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/main/big5/operations/default.json#L32C1-L61C7. These are types of queries a user might have when they first load up a dashboard. Give me all the events that took place in the last 30 days, last 24 hours, etc.
Range queries on timestamps are common use cases for such dashboard events. Typically these are non-scoring queries -- queries that don't have another clause in them that force scoring of documents. For example, if this range queries was used in conjunction/disjunction with a text query, we would need to score all the documents in their order of relevance. Scoring + filtering on a range is a time consuming event and since the introduction of Lucene's
IndexOrDocValuesQuery
, have become much faster thanks to the use ofdoc_values
in certain cases.For the non-scoring cases, I feel we can do better. Today Lucene scores all documents in a segment but collects only 10. OpenSearch enforces that we collect 10,000 documents. By default, a search without the
size
attribute returns 10 hits but can be scrolled to get up-to 10,000 hits. What if we only score 10,000 hits instead of all the documents in the segment? This could significantly speed up these non-scoring queries.Describe the solution you'd like
Similar to how we use other Lucene classes in OpenSearch, we could override the
PointRangeQuery
class. During search time, we use thesearchContext
to figure out the query shape and if it a simple range query. If it fits the description, we use thesize
attribute to determine the number of documents to collect. Ifsize > 10,000
, we collectsize
else we collectmin(size, 10,000)
.We override Lucene's
intesectVisitor
to stop intersecting after collecting the number of hits we want and then start collecting. Then we can override the range queries in the field mappers to point to this new query type.Related component
Search:Performance
Describe alternatives you've considered
Alternative is to do nothing. But this could yield promising results.
Additional context
No response
The text was updated successfully, but these errors were encountered: