-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize 2 keyword multi-terms aggregation #13929
base: main
Are you sure you want to change the base?
Conversation
❌ Gradle check result for bbd49c6: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
For POC, I ran the query the below query for big5 workload and saw 50% reduction in service time. Expand to see multi-term query
Benchmark with big5 workload: Total docs in index: 116000000 (11.6*10^7)
null - indicates the request timed out I tried to benchmark this against eventdata workload since the query time in above big5 workload was too high and I needed a smaller dataset to establish gains, but sadly it doesn't looks like that the change is improving the results. It may actually end up worsening the performance. Total docs in index: 20000000 (2*10^7)
|
while (postings1.docID() != PostingsEnum.NO_MORE_DOCS && postings2.docID() != PostingsEnum.NO_MORE_DOCS) { | ||
|
||
// Count of intersecting docs to get number of docs in each bucket | ||
if (postings1.docID() == postings2.docID()) { | ||
bucketCount++; | ||
postings1.nextDoc(); | ||
postings2.nextDoc(); | ||
} else if (postings1.docID() < postings2.docID()) { | ||
postings1.advance(postings2.docID()); | ||
} else { | ||
postings2.advance(postings1.docID()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to optimize this method.
could you create a fixedbitset and use intersectionCount
https://github.com/apache/lucene/blob/ebea2e1492c95b5d6b1e1032485598f901bda286/lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java#L74
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The complexity of intersection logic is highly dependent on the documents in the posting lists. With larger datasets and higher cardinality, the leapfrogging method for intersection evaluation would require more frequent iterations over these lists, which can be expensive.
This PR is stalled because it has been open for 30 days with no activity. |
Description
Optimize multi-terms aggregation for case:
The optimization changes how buckets are collected for a segment. For the above cases, it presently checks in values for required aggregation for each document, computes the composite key and then updates the bucket count. The optimization utilizes reading posting enums directly so that we are not computing composite keys for each document, and save time by creating composite keys only once and then get the intersection document count by checking intersection of each composite bucket.
Related Issues
Resolves #13120
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.