Reduce created segments when there is low traffic on a shard #618

itiyamas · 2021-04-27T19:51:50Z

Lucene creates 1 segment per active concurrent thread per shard. Number of active concurrent threads per shard is determined by bulk thread pool in ES, which is a worker on the queue holding sub bulk requests. Each sub-bulk request ends up being picked by a different thread, hence resulting in multiple segments during refresh, resulting in better performance during reads. When there is a low traffic on a particular shard on a node, we can potentially reduce the number of created segments by co-alescing bulk requests.

When there are lot of shards on a single node for different indices, this problem may aggravate further.

The proposal is to optimize this entire process is via the following tasks:

Pipeline the CPU work into different tasks further (parsing, translog, lucene add) per shard
Coalesce the lucene documents across different bulk requests by adding a separate queue for lucene adds.

andrross · 2021-11-30T01:36:39Z

Can you more clearly define the problem that is solved by reducing the number of segments that are created on low traffic shards? What is the use case that optimization is targeting?

anasalkouz · 2022-03-31T20:40:54Z

Closing it since we didn't receive a response for a while. @itiyamas feel free to reopen once you have more details about it.

itiyama · 2022-09-20T04:23:53Z

@anasalkouz I do not have permissions to re-open the thread. Can you please open this?

Can you more clearly define the problem that is solved by reducing the number of segments that are created on low traffic shards? What is the use case that optimization is targeting?

In order to understand the use case, you need to understand how lucene parallelizes data writes across multiple segments within the same shard. Whenever data is to be written to Lucene buffer, it tries to get a buffer object from a list of pending buffer objects and ties it to the thread. If it does not find one, a new buffer object is created and documents are added there. These buffer objects are then indepently flushed to a segment during refresh. If there are multiple concurrent threads writing data to Lucene, the system ends up creating more segments. These segments are then merged later which means that we end up spending more compute. Searches also do not work well on lot of small segments.

itiyama · 2022-09-25T01:08:05Z

@dreamer-89 Can you help re-open this?

dreamer-89 · 2022-09-25T01:15:10Z

@dreamer-89 Can you help re-open this?

Issue re-opened.
@itiyama : Can you please share more details around the issue as requested previously.

msfroh · 2024-06-05T22:42:47Z

Fixed by apache/lucene#921

itiyama · 2024-06-13T15:15:05Z

@msfroh The issue that you linked solves the problem of searching through more segments but utilizes more resources to do merging. The proposal here is to create fewer segments in the first place so that you do not spend resources on merging later.

msfroh · 2024-06-13T20:47:01Z

Ahh... okay. I guess we can reopen it.

It feels like a pretty low priority, though, since writing and then merging small segments will (or at least should?) have negligible impact on overall performance.

If/when we move to predominantly pull-based indexing, we can allocate indexing threads per node based on the number of pending documents. (If each shard writes with one thread or less, then each shard will only ever write one segment per flush.)

itiyamas added the enhancement Enhancement or improvement to existing feature or request label Apr 27, 2021

minalsha added the distributed framework label Sep 7, 2021

anasalkouz added the Priority-Low label Nov 16, 2021

anasalkouz closed this as completed Mar 31, 2022

dreamer-89 reopened this Sep 25, 2022

anasalkouz added Migration:Pending Input and removed Migration:Pending Input labels Mar 16, 2023

msfroh closed this as completed Jun 5, 2024

msfroh reopened this Jun 13, 2024

github-actions bot added the untriaged label Jun 13, 2024

msfroh added Indexing:Performance and removed untriaged labels Jun 13, 2024

vikasvb90 added the RFC Issues requesting major changes label Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce created segments when there is low traffic on a shard #618

Reduce created segments when there is low traffic on a shard #618

itiyamas commented Apr 27, 2021 •

edited

Loading

andrross commented Nov 30, 2021

anasalkouz commented Mar 31, 2022 •

edited

Loading

itiyama commented Sep 20, 2022

itiyama commented Sep 25, 2022

dreamer-89 commented Sep 25, 2022 •

edited

Loading

msfroh commented Jun 5, 2024

itiyama commented Jun 13, 2024

msfroh commented Jun 13, 2024

Reduce created segments when there is low traffic on a shard #618

Reduce created segments when there is low traffic on a shard #618

Comments

itiyamas commented Apr 27, 2021 • edited Loading

andrross commented Nov 30, 2021

anasalkouz commented Mar 31, 2022 • edited Loading

itiyama commented Sep 20, 2022

itiyama commented Sep 25, 2022

dreamer-89 commented Sep 25, 2022 • edited Loading

msfroh commented Jun 5, 2024

itiyama commented Jun 13, 2024

msfroh commented Jun 13, 2024

itiyamas commented Apr 27, 2021 •

edited

Loading

anasalkouz commented Mar 31, 2022 •

edited

Loading

dreamer-89 commented Sep 25, 2022 •

edited

Loading