Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-terms Aggregation Performance Optimization #13120

Open
sandeshkr419 opened this issue Apr 8, 2024 · 2 comments · May be fixed by #13929
Open

Multi-terms Aggregation Performance Optimization #13120

sandeshkr419 opened this issue Apr 8, 2024 · 2 comments · May be fixed by #13929

Comments

@sandeshkr419
Copy link
Contributor

sandeshkr419 commented Apr 8, 2024

Starting this thread to discuss ideas for optimizing multi-terms aggregation.

Sample query:

{
  "size": 0,
  "aggs": {
    "important_terms": {
      "multi_terms": {
        "terms": [
          {
            "field": "process.name"
          },
          {
            "field": "cloud.region"
          }
        ]
      }
    }
  }
}

Current flow overview:
For each document, increment the count of composite (formed using multiple fields) bucket.

Initial ideas for optimization:
Trying out to see if for certain scenarios, will it make sense to start the execution from the postings data instead. For example, taking into account the possible buckets and then finding intersection among different buckets to find intersection of documents. Finding doc intersections for different fields is something which we can experiment out to find if it makes any advantage than the current workflow in terms of performance.

@sandeshkr419 sandeshkr419 converted this from a draft issue Apr 8, 2024
@getsaurabh02 getsaurabh02 moved this from Todo to Now (This Quarter) in Performance Roadmap Apr 8, 2024
@getsaurabh02 getsaurabh02 moved this from Now (This Quarter) to In Progress in Performance Roadmap Apr 15, 2024
@sandeshkr419 sandeshkr419 changed the title Mulit-terms Aggregation Performance Optimization Multi-terms Aggregation Performance Optimization Apr 15, 2024
@sandeshkr419
Copy link
Contributor Author

sandeshkr419 commented May 8, 2024

So I started thinking of some ideas.

One of the ideas which came was to take intersection of document set for postings data for 2 fields (in case there are 2 fields involved in a multi-term aggregation), but when doing some basic math around time complexity, it turns out that the resultant time complexity might be greater than the present approach of iterating though all documents in a a match-set. Also, taking intersection of 2 postings data only works for match-all top level query with no document deletes.

Some math I brainstormed with @msfroh offline. (Expand for details)

Assume D document, field1 & field2 with with both n cardinality for simplicity.
Assume uniform distribution across fields.
=> number of docs in each field will be D/n
(1) When doing a multi-terms aggregation on field1 & field2 => n^2 buckets
(2) Time to find each bucket intersection will be of O(D/n) if using postings data since the document sets are sorted so linear traversal will be required
=> (3) Final time complexity will be O(n*D) in that case. --- (1),(2) =>(3)

Whereas if we are using value source (in present code)
the complexity is (cost of fetching valuesource) * O(D). ---- (4)

It seems initially that (4) < (3)
as cost of fetching value source will not exceed n I guess.

I was thinking if if time to find each bucket intersection - if we can make it substantially less than O(D/n) - then we might have a chance.

Although the time complexities of 2 algorithms is not an entirely apples to apples comparison, but it does looks like that the approach might not work, but again there are some gaps which we may have not yet discovered.

As an extension to the above strategy, we also thought on the lines of if somehow we can cut-short some of the intersections looking at the terms frequency. The idea was to get rid of buckets with low cardinality, but then the problem was that those quick terminations can be made only at a segment level and if the fields values are not so uniformly distributed, then we might get rid of buckets which may have high cardinality in other segments.

Let me see if I can find more ways to see possible optimizations.

@getsaurabh02 getsaurabh02 added the v2.15.0 Issues and PRs related to version 2.15.0 label May 28, 2024
@sandeshkr419 sandeshkr419 linked a pull request Jun 3, 2024 that will close this issue
9 tasks
@mch2 mch2 added v2.17.0 and removed v2.15.0 Issues and PRs related to version 2.15.0 labels Jul 22, 2024
@getsaurabh02 getsaurabh02 moved this from In Progress to In-Review in Performance Roadmap Aug 5, 2024
@sandeshkr419
Copy link
Contributor Author

sandeshkr419 commented Aug 5, 2024

Linking #14993 here as it contributes to improving multi-terms aggregation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In-Review
Status: Now(This Quarter)
Development

Successfully merging a pull request may close this issue.

3 participants