-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-terms Aggregation Performance Optimization #13120
Comments
So I started thinking of some ideas. One of the ideas which came was to take intersection of document set for postings data for 2 fields (in case there are 2 fields involved in a multi-term aggregation), but when doing some basic math around time complexity, it turns out that the resultant time complexity might be greater than the present approach of iterating though all documents in a a match-set. Also, taking intersection of 2 postings data only works for match-all top level query with no document deletes. Some math I brainstormed with @msfroh offline. (Expand for details)Assume D document, field1 & field2 with with both n cardinality for simplicity. Whereas if we are using value source (in present code) It seems initially that (4) < (3) I was thinking if if time to find each bucket intersection - if we can make it substantially less than O(D/n) - then we might have a chance. Although the time complexities of 2 algorithms is not an entirely apples to apples comparison, but it does looks like that the approach might not work, but again there are some gaps which we may have not yet discovered. As an extension to the above strategy, we also thought on the lines of if somehow we can cut-short some of the intersections looking at the terms frequency. The idea was to get rid of buckets with low cardinality, but then the problem was that those quick terminations can be made only at a segment level and if the fields values are not so uniformly distributed, then we might get rid of buckets which may have high cardinality in other segments. Let me see if I can find more ways to see possible optimizations. |
Linking #14993 here as it contributes to improving multi-terms aggregation. |
Starting this thread to discuss ideas for optimizing multi-terms aggregation.
Sample query:
Current flow overview:
For each document, increment the count of composite (formed using multiple fields) bucket.
Initial ideas for optimization:
Trying out to see if for certain scenarios, will it make sense to start the execution from the postings data instead. For example, taking into account the possible buckets and then finding intersection among different buckets to find intersection of documents. Finding doc intersections for different fields is something which we can experiment out to find if it makes any advantage than the current workflow in terms of performance.
The text was updated successfully, but these errors were encountered: