Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce created segments when there is low traffic on a shard #618

Open
2 tasks
itiyamas opened this issue Apr 27, 2021 · 8 comments
Open
2 tasks

Reduce created segments when there is low traffic on a shard #618

itiyamas opened this issue Apr 27, 2021 · 8 comments
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request Indexing:Performance Priority-Low RFC Issues requesting major changes

Comments

@itiyamas
Copy link
Contributor

itiyamas commented Apr 27, 2021

Lucene creates 1 segment per active concurrent thread per shard. Number of active concurrent threads per shard is determined by bulk thread pool in ES, which is a worker on the queue holding sub bulk requests. Each sub-bulk request ends up being picked by a different thread, hence resulting in multiple segments during refresh, resulting in better performance during reads. When there is a low traffic on a particular shard on a node, we can potentially reduce the number of created segments by co-alescing bulk requests.

When there are lot of shards on a single node for different indices, this problem may aggravate further.

The proposal is to optimize this entire process is via the following tasks:

  • Pipeline the CPU work into different tasks further (parsing, translog, lucene add) per shard
  • Coalesce the lucene documents across different bulk requests by adding a separate queue for lucene adds.
@itiyamas itiyamas added the enhancement Enhancement or improvement to existing feature or request label Apr 27, 2021
@andrross
Copy link
Member

Can you more clearly define the problem that is solved by reducing the number of segments that are created on low traffic shards? What is the use case that optimization is targeting?

@anasalkouz
Copy link
Member

anasalkouz commented Mar 31, 2022

Closing it since we didn't receive a response for a while. @itiyamas feel free to reopen once you have more details about it.

@itiyama
Copy link

itiyama commented Sep 20, 2022

@anasalkouz I do not have permissions to re-open the thread. Can you please open this?

Can you more clearly define the problem that is solved by reducing the number of segments that are created on low traffic shards? What is the use case that optimization is targeting?

In order to understand the use case, you need to understand how lucene parallelizes data writes across multiple segments within the same shard. Whenever data is to be written to Lucene buffer, it tries to get a buffer object from a list of pending buffer objects and ties it to the thread. If it does not find one, a new buffer object is created and documents are added there. These buffer objects are then indepently flushed to a segment during refresh. If there are multiple concurrent threads writing data to Lucene, the system ends up creating more segments. These segments are then merged later which means that we end up spending more compute. Searches also do not work well on lot of small segments.

@itiyama
Copy link

itiyama commented Sep 25, 2022

@dreamer-89 Can you help re-open this?

@dreamer-89 dreamer-89 reopened this Sep 25, 2022
@dreamer-89
Copy link
Member

dreamer-89 commented Sep 25, 2022

@dreamer-89 Can you help re-open this?

Issue re-opened.
@itiyama : Can you please share more details around the issue as requested previously.

@msfroh
Copy link
Collaborator

msfroh commented Jun 5, 2024

Fixed by apache/lucene#921

@msfroh msfroh closed this as completed Jun 5, 2024
@itiyama
Copy link

itiyama commented Jun 13, 2024

@msfroh The issue that you linked solves the problem of searching through more segments but utilizes more resources to do merging. The proposal here is to create fewer segments in the first place so that you do not spend resources on merging later.

@msfroh
Copy link
Collaborator

msfroh commented Jun 13, 2024

Ahh... okay. I guess we can reopen it.

It feels like a pretty low priority, though, since writing and then merging small segments will (or at least should?) have negligible impact on overall performance.

If/when we move to predominantly pull-based indexing, we can allocate indexing threads per node based on the number of pending documents. (If each shard writes with one thread or less, then each shard will only ever write one segment per flush.)

@msfroh msfroh reopened this Jun 13, 2024
@vikasvb90 vikasvb90 added the RFC Issues requesting major changes label Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request Indexing:Performance Priority-Low RFC Issues requesting major changes
Projects
None yet
Development

No branches or pull requests

8 participants