Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Explore the use of document BPReorderer introduced in lucene #12257

Open
rishabhmaurya opened this issue Feb 8, 2024 · 1 comment
Open
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Cost/Performance/Scale Project-wide roadmap label Search:Performance

Comments

@rishabhmaurya
Copy link
Contributor

Is your feature request related to a problem? Please describe

I'm exploring the use of apache/lucene#12489 for document reordering at the time of segment merges to optimize on postings storage. I would like to understand gains in terms of storage costs and also improvements in query runtime cost and efficiency. And also analyze the tradeoffs in terms of segment merges resource consumption and the impact on indexing. And possibly optimize the algorithm for specialized workloads. Once we have a fair understanding of gains and tradeoffs, we can integrate it into OpenSearch by exposing an reorder API against an index and index management action. If we can smartly integrate it as part of segment merges, which lucene community is exploring - apache/lucene#12665, then that would be our best bet to avoid exposing it to the users.

Describe the solution you'd like

It's still in early phase and I'm thinking of breaking it down into following tracks -

  1. Understand and measure the impact of running reorderer - For this I'm thinking of running it against various workloads available in OpenSearch benchmark. Measuring the system resource utilization while running reorderer and comparing it against the baseline should help us understand the impact. We can also configure lucene throttling mechanisms for segment merges to understand the impact and compare against baseline. We can also try limiting the number of cpu cores available to reorderer and compare the runtime against the baseline. The reoderer algorithm implemented in lucene has a RAM budget assigned, so we can try to tune it to understand the memory requirements.

  2. Measuring query runtime cost: Document reordering on the basis of how similar they are, has a direct correlation with query results too. More similar the documents, higher is the likelihood they would make it to the query results together. This can translate into better query performance and cost to run a given query as seen here - Enable recursive graph bisection out of the box? apache/lucene#12665 (comment). We also expect query to perform better especially in resource constrained environment. So looking for ideas on how to simulate such resource constraint environment? Maybe limit the number of search threads, use an instance with limited memory which can barely fit the complete index?

  3. Optimize the algorithm for specialized use cases - like time series data or security analytics?

The Bipartite graph partition algorithm used in lucene implementation for document reordering has a runtime of O(MlogN + Nlog^2N) where N is number of documents and M being total number of postings. O(MLogN) is a long poll for most of the use cases because of M >> N and this component comprises of part where the all documents score is computed using loggap method described in the paper (https://arxiv.org/pdf/1602.08820.pdf). Each document is assigned a score to check to its affinity toward a left/right partition, which needs to computed every time there is a shuffle of document between partitions and also at every recursive call to sub partitions. jpountz and gf2121 have optimized the algorithm significantly by making use of forward index to compute this score and also by not recomputing the complete score of all documents after each shuffle iteration. Can we optimize it further for specialized use cases or in general?

Let me know your thoughts or ideas to improve upon any of these tracks?

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

@rishabhmaurya rishabhmaurya added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 8, 2024
@getsaurabh02 getsaurabh02 moved this from Todo to In Progress in Performance Roadmap Feb 8, 2024
@rishabhmaurya rishabhmaurya self-assigned this Feb 9, 2024
@peternied peternied added the RFC Issues requesting major changes label Feb 14, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@rishabhmaurya Thanks for filing, we look forward to seeing where this goes.

@sohami sohami added the Roadmap:Cost/Performance/Scale Project-wide roadmap label label May 14, 2024
@github-project-automation github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024
@getsaurabh02 getsaurabh02 moved this from In Progress to Now (This Quarter) in Performance Roadmap Aug 5, 2024
@getsaurabh02 getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Cost/Performance/Scale Project-wide roadmap label Search:Performance
Projects
Status: New
Status: Now (This Quarter)
Status: Later (6 months plus)
Development

No branches or pull requests

3 participants