Multi-terms Aggregation Performance Optimization #13120

sandeshkr419 · 2024-04-08T17:38:50Z

Starting this thread to discuss ideas for optimizing multi-terms aggregation.

Sample query:

{
  "size": 0,
  "aggs": {
    "important_terms": {
      "multi_terms": {
        "terms": [
          {
            "field": "process.name"
          },
          {
            "field": "cloud.region"
          }
        ]
      }
    }
  }
}

Current flow overview:
For each document, increment the count of composite (formed using multiple fields) bucket.

Initial ideas for optimization:
Trying out to see if for certain scenarios, will it make sense to start the execution from the postings data instead. For example, taking into account the possible buckets and then finding intersection among different buckets to find intersection of documents. Finding doc intersections for different fields is something which we can experiment out to find if it makes any advantage than the current workflow in terms of performance.

The text was updated successfully, but these errors were encountered:

sandeshkr419 · 2024-05-08T21:27:57Z

So I started thinking of some ideas.

One of the ideas which came was to take intersection of document set for postings data for 2 fields (in case there are 2 fields involved in a multi-term aggregation), but when doing some basic math around time complexity, it turns out that the resultant time complexity might be greater than the present approach of iterating though all documents in a a match-set. Also, taking intersection of 2 postings data only works for match-all top level query with no document deletes.

Some math I brainstormed with @msfroh offline. (Expand for details)

Assume D document, field1 & field2 with with both n cardinality for simplicity.
Assume uniform distribution across fields.
=> number of docs in each field will be D/n
(1) When doing a multi-terms aggregation on field1 & field2 => n^2 buckets
(2) Time to find each bucket intersection will be of O(D/n) if using postings data since the document sets are sorted so linear traversal will be required
=> (3) Final time complexity will be O(n*D) in that case. --- (1),(2) =>(3)

Whereas if we are using value source (in present code)
the complexity is (cost of fetching valuesource) * O(D). ---- (4)

It seems initially that (4) < (3)
as cost of fetching value source will not exceed n I guess.

I was thinking if if time to find each bucket intersection - if we can make it substantially less than O(D/n) - then we might have a chance.

Although the time complexities of 2 algorithms is not an entirely apples to apples comparison, but it does looks like that the approach might not work, but again there are some gaps which we may have not yet discovered.

As an extension to the above strategy, we also thought on the lines of if somehow we can cut-short some of the intersections looking at the terms frequency. The idea was to get rid of buckets with low cardinality, but then the problem was that those quick terminations can be made only at a segment level and if the fields values are not so uniformly distributed, then we might get rid of buckets which may have high cardinality in other segments.

Let me see if I can find more ways to see possible optimizations.

sandeshkr419 · 2024-08-05T18:44:09Z

Linking #14993 here as it contributes to improving multi-terms aggregation.

sandeshkr419 added this to Performance Roadmap Apr 3, 2024

sandeshkr419 converted this from a draft issue Apr 8, 2024

github-actions bot added the untriaged label Apr 8, 2024

sandeshkr419 added Search:Aggregations Search:Performance and removed untriaged labels Apr 8, 2024

github-project-automation bot added this to Search Project Board Apr 8, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Apr 8, 2024

sandeshkr419 mentioned this issue Apr 8, 2024

Composite Terms Aggregation Performance Optimizaiton #13121

Open

getsaurabh02 moved this from Todo to Now (This Quarter) in Performance Roadmap Apr 8, 2024

getsaurabh02 moved this from Now (This Quarter) to In Progress in Performance Roadmap Apr 15, 2024

sandeshkr419 changed the title ~~Mulit-terms Aggregation Performance Optimization~~ Multi-terms Aggregation Performance Optimization Apr 15, 2024

getsaurabh02 added the v2.15.0 Issues and PRs related to version 2.15.0 label May 28, 2024

sandeshkr419 linked a pull request Jun 3, 2024 that will close this issue

Optimize 2 keyword multi-terms aggregation #13929

Draft

9 tasks

getsaurabh02 assigned sandeshkr419 Jul 8, 2024

mch2 added v2.17.0 and removed v2.15.0 Issues and PRs related to version 2.15.0 labels Jul 22, 2024

getsaurabh02 moved this from In Progress to In-Review in Performance Roadmap Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-terms Aggregation Performance Optimization #13120

Multi-terms Aggregation Performance Optimization #13120

sandeshkr419 commented Apr 8, 2024 •

edited

Loading

sandeshkr419 commented May 8, 2024 •

edited

Loading

sandeshkr419 commented Aug 5, 2024 •

edited

Loading

Multi-terms Aggregation Performance Optimization #13120

Multi-terms Aggregation Performance Optimization #13120

Comments

sandeshkr419 commented Apr 8, 2024 • edited Loading

sandeshkr419 commented May 8, 2024 • edited Loading

sandeshkr419 commented Aug 5, 2024 • edited Loading

sandeshkr419 commented Apr 8, 2024 •

edited

Loading

sandeshkr419 commented May 8, 2024 •

edited

Loading

sandeshkr419 commented Aug 5, 2024 •

edited

Loading