Query rewrite to improve caching of co-occurring required clauses #10818

msfroh · 2023-10-21T17:41:07Z

Is your feature request related to a problem? Please describe.
Lucene has a built-in LRU query cache that saves doc ID bitsets of queries (and subqueries). Unfortunately, it's relatively naive insofar as it does exactly what it is asked to do.

Suppose you have a query like:

"bool": {
  "must": [
    { 
      "match" : {
        "message" : "serious error"
      }
    }
  ],
  "filter": [
    {
      "term" : { "log_type" : "application_log" }
    },
    {
      "terms": { "status_code" : ["400","401","403","429","500","503"] }
    },
    {
      "range": { "timestamp" : { "gte": <24 hours ago in epoch milliseconds> } }
    }
  ],
  "must_not": [
    {
      "term" : { "classification" : "super_secret" }
    }
  ]
}

That whole Boolean query is could be cached, but because of the "range": { "timestamp" : { "gte": <24 hours ago in epoch milliseconds> } }, an application sending "the same" query even a millisecond later will technically send a different query.

Imagine if instead we rewrite the query like:

"bool": {
  "must": [
    { 
      "match" : {
        "message" : "serious error"
      }
    }
  ],
  "filter": [
    {
      "range": { "timestamp" : { "gte": <24 hours ago in epoch milliseconds> } }
    },
    {
      "bool: {
        "filter": [
          {
            "term" : { "log_type" : "application_log" }
          },
          {
            "terms": { "status_code" : ["400","401","403","429","500","503"] }
          },
      
        ],
        "must_not": [
          {
            "term" : { "classification" : "super_secret" }
          }
        ]
      }
    }
  ]
}

I this case, we've isolated the "static" constraints on log_type, status_code, and classification from the "dynamic" clauses (the user input and time range). The documents that satisfy the static constraints can be computed once (per segment) and cached as a bitset without recomputing their intersection.

Describe the solution you'd like
Ideally, I would like OpenSearch to see a few queries like the first version above and start rewriting them to the second form.

The approach I used previously (see the mailing list in the additional context below) involved training offline against query logs, using the FP-Growth algorithm to find frequently cooccurring required clauses. We could probably save a window of queries (on heap? in cluster state?), probably collected at the coordinator node level and periodically check them for frequently-occurring required/forbidden clauses. If we see any of them in a query, we could pull them out into a separate boolean query.

We may also benefit from putting these queries in their own dedicated LRU cache that gets eagerly warmed on refresh.

Describe alternatives you've considered
In theory, we could argue that "it's the user's fault", since the application developers are the ones most likely to understand what constraints are likely to be static (and frequently co-occurring) and which ones are going to be dynamic (relative to user input/context, the current time, etc). The application developer "should have" sent the query in the second form above. Of course, that's not a very user-friendly attitude to take. If we can notice the patterns and rewrite the query automatically, it's a much better user experience.

See also the mailing list thread in the "additional context" below for some discussion on offline training and index-time precomputation of intersections of required/forbidden clauses. That's a really nice way to do it, but is probably not useful to most OpenSearch users.

Additional context
Here is a thread I started on the Lucene mailing list a few years back, discussing the implementation that we had on Amazon Product Search: https://lists.apache.org/thread/0lmgrhbnkx8kd3omgy1ccrybvlnq941y. Our existing implementation worked well for that specific use-case (with regular reindexing and access to large volumes of query logs), such that we could precompute the intersection of required clauses in an offline training job then index that intersection as a special term on matching documents. @rmuir had the excellent suggestion of treating it as a query caching problem instead, which is much more broadly applicable.

The text was updated successfully, but these errors were encountered:

msfroh added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 21, 2023

anasalkouz added the Other label Dec 14, 2023

ankitkala added Indexing & Search and removed Other labels Dec 17, 2023

ankitkala added Search Search query, autocomplete ...etc and removed untriaged Indexing & Search labels Feb 15, 2024

github-project-automation bot added this to Search Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 15, 2024

getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query rewrite to improve caching of co-occurring required clauses #10818

Query rewrite to improve caching of co-occurring required clauses #10818

msfroh commented Oct 21, 2023

Query rewrite to improve caching of co-occurring required clauses #10818

Query rewrite to improve caching of co-occurring required clauses #10818

Comments

msfroh commented Oct 21, 2023