Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query rewrite to improve caching of co-occurring required clauses #10818

Open
msfroh opened this issue Oct 21, 2023 · 0 comments
Open

Query rewrite to improve caching of co-occurring required clauses #10818

msfroh opened this issue Oct 21, 2023 · 0 comments
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc

Comments

@msfroh
Copy link
Collaborator

msfroh commented Oct 21, 2023

Is your feature request related to a problem? Please describe.
Lucene has a built-in LRU query cache that saves doc ID bitsets of queries (and subqueries). Unfortunately, it's relatively naive insofar as it does exactly what it is asked to do.

Suppose you have a query like:

"bool": {
  "must": [
    { 
      "match" : {
        "message" : "serious error"
      }
    }
  ],
  "filter": [
    {
      "term" : { "log_type" : "application_log" }
    },
    {
      "terms": { "status_code" : ["400","401","403","429","500","503"] }
    },
    {
      "range": { "timestamp" : { "gte": <24 hours ago in epoch milliseconds> } }
    }
  ],
  "must_not": [
    {
      "term" : { "classification" : "super_secret" }
    }
  ]
}

That whole Boolean query is could be cached, but because of the "range": { "timestamp" : { "gte": <24 hours ago in epoch milliseconds> } }, an application sending "the same" query even a millisecond later will technically send a different query.

Imagine if instead we rewrite the query like:

"bool": {
  "must": [
    { 
      "match" : {
        "message" : "serious error"
      }
    }
  ],
  "filter": [
    {
      "range": { "timestamp" : { "gte": <24 hours ago in epoch milliseconds> } }
    },
    {
      "bool: {
        "filter": [
          {
            "term" : { "log_type" : "application_log" }
          },
          {
            "terms": { "status_code" : ["400","401","403","429","500","503"] }
          },
      
        ],
        "must_not": [
          {
            "term" : { "classification" : "super_secret" }
          }
        ]
      }
    }
  ]
}

I this case, we've isolated the "static" constraints on log_type, status_code, and classification from the "dynamic" clauses (the user input and time range). The documents that satisfy the static constraints can be computed once (per segment) and cached as a bitset without recomputing their intersection.

Describe the solution you'd like
Ideally, I would like OpenSearch to see a few queries like the first version above and start rewriting them to the second form.

The approach I used previously (see the mailing list in the additional context below) involved training offline against query logs, using the FP-Growth algorithm to find frequently cooccurring required clauses. We could probably save a window of queries (on heap? in cluster state?), probably collected at the coordinator node level and periodically check them for frequently-occurring required/forbidden clauses. If we see any of them in a query, we could pull them out into a separate boolean query.

We may also benefit from putting these queries in their own dedicated LRU cache that gets eagerly warmed on refresh.

Describe alternatives you've considered
In theory, we could argue that "it's the user's fault", since the application developers are the ones most likely to understand what constraints are likely to be static (and frequently co-occurring) and which ones are going to be dynamic (relative to user input/context, the current time, etc). The application developer "should have" sent the query in the second form above. Of course, that's not a very user-friendly attitude to take. If we can notice the patterns and rewrite the query automatically, it's a much better user experience.

See also the mailing list thread in the "additional context" below for some discussion on offline training and index-time precomputation of intersections of required/forbidden clauses. That's a really nice way to do it, but is probably not useful to most OpenSearch users.

Additional context
Here is a thread I started on the Lucene mailing list a few years back, discussing the implementation that we had on Amazon Product Search: https://lists.apache.org/thread/0lmgrhbnkx8kd3omgy1ccrybvlnq941y. Our existing implementation worked well for that specific use-case (with regular reindexing and access to large volumes of query logs), such that we could precompute the intersection of required clauses in an offline training job then index that intersection as a special term on matching documents. @rmuir had the excellent suggestion of treating it as a query caching problem instead, which is much more broadly applicable.

@msfroh msfroh added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 21, 2023
@ankitkala ankitkala added Search Search query, autocomplete ...etc and removed untriaged Indexing & Search labels Feb 15, 2024
@getsaurabh02 getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

3 participants