Query rewrite to improve caching of co-occurring required clauses #10818
Labels
enhancement
Enhancement or improvement to existing feature or request
Search
Search query, autocomplete ...etc
Is your feature request related to a problem? Please describe.
Lucene has a built-in LRU query cache that saves doc ID bitsets of queries (and subqueries). Unfortunately, it's relatively naive insofar as it does exactly what it is asked to do.
Suppose you have a query like:
That whole Boolean query is could be cached, but because of the
"range": { "timestamp" : { "gte": <24 hours ago in epoch milliseconds> } }
, an application sending "the same" query even a millisecond later will technically send a different query.Imagine if instead we rewrite the query like:
I this case, we've isolated the "static" constraints on
log_type
,status_code
, andclassification
from the "dynamic" clauses (the user input and time range). The documents that satisfy the static constraints can be computed once (per segment) and cached as a bitset without recomputing their intersection.Describe the solution you'd like
Ideally, I would like OpenSearch to see a few queries like the first version above and start rewriting them to the second form.
The approach I used previously (see the mailing list in the additional context below) involved training offline against query logs, using the FP-Growth algorithm to find frequently cooccurring required clauses. We could probably save a window of queries (on heap? in cluster state?), probably collected at the coordinator node level and periodically check them for frequently-occurring required/forbidden clauses. If we see any of them in a query, we could pull them out into a separate boolean query.
We may also benefit from putting these queries in their own dedicated LRU cache that gets eagerly warmed on refresh.
Describe alternatives you've considered
In theory, we could argue that "it's the user's fault", since the application developers are the ones most likely to understand what constraints are likely to be static (and frequently co-occurring) and which ones are going to be dynamic (relative to user input/context, the current time, etc). The application developer "should have" sent the query in the second form above. Of course, that's not a very user-friendly attitude to take. If we can notice the patterns and rewrite the query automatically, it's a much better user experience.
See also the mailing list thread in the "additional context" below for some discussion on offline training and index-time precomputation of intersections of required/forbidden clauses. That's a really nice way to do it, but is probably not useful to most OpenSearch users.
Additional context
Here is a thread I started on the Lucene mailing list a few years back, discussing the implementation that we had on Amazon Product Search: https://lists.apache.org/thread/0lmgrhbnkx8kd3omgy1ccrybvlnq941y. Our existing implementation worked well for that specific use-case (with regular reindexing and access to large volumes of query logs), such that we could precompute the intersection of required clauses in an offline training job then index that intersection as a special term on matching documents. @rmuir had the excellent suggestion of treating it as a query caching problem instead, which is much more broadly applicable.
The text was updated successfully, but these errors were encountered: