[RFC] Explore the use of document BPReorderer introduced in lucene #12257

rishabhmaurya · 2024-02-08T19:32:58Z

Is your feature request related to a problem? Please describe

I'm exploring the use of apache/lucene#12489 for document reordering at the time of segment merges to optimize on postings storage. I would like to understand gains in terms of storage costs and also improvements in query runtime cost and efficiency. And also analyze the tradeoffs in terms of segment merges resource consumption and the impact on indexing. And possibly optimize the algorithm for specialized workloads. Once we have a fair understanding of gains and tradeoffs, we can integrate it into OpenSearch by exposing an reorder API against an index and index management action. If we can smartly integrate it as part of segment merges, which lucene community is exploring - apache/lucene#12665, then that would be our best bet to avoid exposing it to the users.

Describe the solution you'd like

It's still in early phase and I'm thinking of breaking it down into following tracks -

Understand and measure the impact of running reorderer - For this I'm thinking of running it against various workloads available in OpenSearch benchmark. Measuring the system resource utilization while running reorderer and comparing it against the baseline should help us understand the impact. We can also configure lucene throttling mechanisms for segment merges to understand the impact and compare against baseline. We can also try limiting the number of cpu cores available to reorderer and compare the runtime against the baseline. The reoderer algorithm implemented in lucene has a RAM budget assigned, so we can try to tune it to understand the memory requirements.
Measuring query runtime cost: Document reordering on the basis of how similar they are, has a direct correlation with query results too. More similar the documents, higher is the likelihood they would make it to the query results together. This can translate into better query performance and cost to run a given query as seen here - Enable recursive graph bisection out of the box? apache/lucene#12665 (comment). We also expect query to perform better especially in resource constrained environment. So looking for ideas on how to simulate such resource constraint environment? Maybe limit the number of search threads, use an instance with limited memory which can barely fit the complete index?
Optimize the algorithm for specialized use cases - like time series data or security analytics?

The Bipartite graph partition algorithm used in lucene implementation for document reordering has a runtime of O(MlogN + Nlog^2N) where N is number of documents and M being total number of postings. O(MLogN) is a long poll for most of the use cases because of M >> N and this component comprises of part where the all documents score is computed using loggap method described in the paper (https://arxiv.org/pdf/1602.08820.pdf). Each document is assigned a score to check to its affinity toward a left/right partition, which needs to computed every time there is a shuffle of document between partitions and also at every recursive call to sub partitions. jpountz and gf2121 have optimized the algorithm significantly by making use of forward index to compute this score and also by not recomputing the complete score of all documents after each shuffle iteration. Can we optimize it further for specialized use cases or in general?

Let me know your thoughts or ideas to improve upon any of these tracks?

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

peternied · 2024-02-14T16:53:24Z

[Triage - attendees 1 2 3 4 5 6 7 8]
@rishabhmaurya Thanks for filing, we look forward to seeing where this goes.

rishabhmaurya added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 8, 2024

github-actions bot added the Search:Performance label Feb 8, 2024

github-project-automation bot added this to Search Project Board Feb 8, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 8, 2024

rishabhmaurya removed this from Search Project Board Feb 8, 2024

rishabhmaurya added this to Performance Roadmap Feb 8, 2024

github-project-automation bot moved this to Todo in Performance Roadmap Feb 8, 2024

getsaurabh02 moved this from Todo to In Progress in Performance Roadmap Feb 8, 2024

rishabhmaurya self-assigned this Feb 9, 2024

peternied added the RFC Issues requesting major changes label Feb 14, 2024

github-project-automation bot added this to Search Project Board Feb 14, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 14, 2024

peternied removed the untriaged label Feb 14, 2024

sohami added the Roadmap:Cost/Performance/Scale Project-wide roadmap label label May 14, 2024

getsaurabh02 added this to OpenSearch Roadmap May 31, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024

getsaurabh02 moved this from In Progress to Now (This Quarter) in Performance Roadmap Aug 5, 2024

getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Explore the use of document BPReorderer introduced in lucene #12257

[RFC] Explore the use of document BPReorderer introduced in lucene #12257

rishabhmaurya commented Feb 8, 2024

peternied commented Feb 14, 2024

[RFC] Explore the use of document BPReorderer introduced in lucene #12257

[RFC] Explore the use of document BPReorderer introduced in lucene #12257

Comments

rishabhmaurya commented Feb 8, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

peternied commented Feb 14, 2024