-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for criteria based DWPT selection inside DocumentWriter #13387
Comments
I like this idea! I hope we can find a simple enough API exposed through IWC to enable the optional grouping. This also has nice mechanical sympathy / symmetry with the distributed search engine analog. A distributed search engine like OpenSearch indexes and searches into N shards across multiple servers, and this is nearly precisely the same logical problem that Lucene tackles on a single multi-core server when indexing and searching into N segments, especially as Lucene's intra-query concurrency becomes the norm/default and improves (e.g. allowing intra-segment per query concurrency as well). We should cross-fertilize more based on this analogy: the two problems are nearly the same. A shard, a segment, same thing heh (nearly). So this proposal is bringing custom document routing feature from OpenSearch, down into Lucene's segments. |
This is an interesting idea! You do not mention it explicitly in the issue description, but presumably this only makes sense if an index sort is configured, otherwise merges may break the clustering that you are trying to create in the first place?
I'm a bit uncomfortable with this approach. It is so heavy that it wouldn't perform much better than maintaining a separate |
Thanks Mike and Adrian for the feedback.
Not exactly. As mentioned, in order to ensure that grouping criteria invariant is maintained even during segment merges, we are introducing a new merge policy that acts as a decorator over the existing Tiered Merge policy. During a segment merge, this policy would categorize segments according to their grouping function outcomes before merging segments within the same category, thus maintaining the grouping criteria’s integrity throughout the merge process.
I believe even if we use a single DWPT pool with rendezvous hashing to distribute DWPTs we would end up creating same number of DWPTs as having different DWPT pools for different group. Consider an example where we are grouping logs based on status code for an index and 8 concurrent indexing thread is indexing 2xx status code logs. This will create 8 DWPTs. Now 4 threads starts indexing 4xx status code logs concurrently, this will require 4 extra DWPTs for indexing logs if we want to maintain status code based grouping. Instead of creating new DWPTs, we can try reusing existing 4 DWPTs created for 2xx status code logs on best effort basis. But this will again mix 4xx status code logs with 2xx status code logs defeating the purpose of status code based grouping of logs. Also to ensure that number of DWPTs created are in check, we will be creating guardrails on number of groups that can be generated from grouping function. Let me know if my understanding is correct. |
Thanks for explaining. The concern I have given how we're planning on never flushing/merging segments from the same group is that this would essentially perform the same as maintaining one To get similar benefits from clustering but without incurring the overhead of segments, I feel like we should rather improve our support for clustering at the doc ID level, ie. index sorting. And maybe ideas like this criteria-based selection of DWPTs could help speed up the creation of sorted indexes? |
Thanks for the suggestion. Above suggestion for clustering within the segment does improves skipping of documents (especially when combined with BKD optimisation to skip non competitive documents). But it still limits us from building multiple optimisations which could be done by having separate DWPT pools for different groups:
Actually, we won't be able to build multiple optimizations on top of the segment topology if we store them together. Let me know if this makes sense. |
I agree that better organizing data across segments yields significant benefits, I'm only advocating for doing this by maintaining a separate |
Sorry missed answering this part in my earlier response. We did explore this approach of creating an IndexWriter/Lucene Index (or OpenSearch shard) for each group. However, implementing this approach would lead to significant overhead on the client side (such as OpenSearch) both in the terms of code changes and operational overhead like metadata management. On the other hand, maintaining separate DWPT pools for different groups would require minimal changes inside Lucene. The overhead will be lesser here as Lucene shard will still be maintained as a single physical unit. Let me know if this makes sense. |
Attaching a preliminary PR for the POC related to above issue to share my understanding. Please note that this is not the final PR. |
Can you give more details? The main difference that comes to mind is that using multiple |
I like @jpountz's idea of just using separate The idea of using a single underlying multi-tenant You would also need a clean-ish way to manage a single total allowed RAM bytes across the N Searching across the N separate shards as if they were a single index is also possible via |
I don't think we do. +1 to exploring this separately. I like that we then wouldn't need to tune the merge policy because it would naturally only see segments that belong to its group.
Right,
Indeed, I'd expect it to work just fine. |
Thanks a lot for suggestions @jpountz and @mikemccand. As suggested above, we worked on a POC to explore using separate IndexWriter for different groups. Each IndexWriter is associated with a distinct logical filter directories, which attaches a filename prefix according to the group. These directories are backed by a single multi tenant directory. However this approach presents several challenges on the Client (OpenSearch) side. Each IndexWriter now generates its own sequence number. In a service like OpenSearch where Translog operates based on sequence numbers at the Lucene Index level. When the same sequence number is generated across different IndexWriter for a same Lucene Index, conflicts can occur during operation like Translog replay. Additionally, local and global checkpoints maintained during recovery operation in service like OpenSearch require sequence number to be a continuous increasing number which won't be valid with multiple IndexWriter. We did not face these issue when different groups were represented by different DWPT pools. This is because there was only a single IndexWriter writing to a Lucene Index, generating a continuous increasing sequence number. The complexity of handling different segments for different groups is managed internally at Lucene level, rather than propagating it to the client side. Feel free to share any further suggestions you may have on this. |
This would indeed get somewhat tricky. But is OpenSearch really using Lucene's returned sequence numbers? I had thought Elasticsearch's sequence number implementation predated the Lucene change adding sequence numbers to every low-level Lucene operation that mutates the index. Under the hood, |
I wonder if we can leverage IndexWriter's This could mean that each shard for an OpenSearch/Elasticsearch index would maintain internal indexes for each desired category, and use the API to combine them into a common "shard" index at every flush? We'd still need a way to maintain category labels for a segment during merging, but that's a common problem for any approach we take. |
Thanks mikemccand and vigyasharma for suggestions. Evaluated different approaches to use different IndexWriter for different groups: Approach 1: Using filter directory for each groupIn this approach, each group (for above example grouping criteria is status code) has its own To address the sequence number conflict between different Pros
Cons
|
Approach 2: Using a physical directory for each groupTo segregate segments belonging to different groups and avoid attaching a prefix to segment names, we associated group-level IndexWriters with a physical directory instead of a filter directory. Pros
Cons
|
Approach 3: Combining group level IndexWriter with addIndexesIn this approach, in order to make multiple group-level IndexWriters function as a unified entity, we use the Lucene’s addIndxes api to combine them. This ensures that the top-level IndexWriter shares a common Pros
Cons
|
SummaryIn summary the problem can be broken down into three sub problems.
With the different approaches we investigated, none of them satisfies/solves the above 3 sub problems cleanly with decent complexity. That leaves us with the originally suggested approach of using different DWPTs to represent different groups. The original approach:
ExploringIn parallel, we are still exploring if we can introduce an API for Open for thoughts and suggestions. |
How do background index merges work with the original, separate DWPT based approach? Don't you need to ensure that you only merge segments that belong to a single group? |
We will be introducing a new merge policy in this case as well to ensure grouping criteria invariant is maintained even during segment merges. Original changes proposed was DWPT side of changes with a new merge policy which ensure same group segments are merged. |
On some more analysis figured out an approach which addresses all the above comments and obtain same improvement with different IndexWriter for different group as we got with using different DWPTs for different group. Using separate IndexWriter for maintaining different tenants with a combined viewCurrent IssueMaintaining separate IndexWriter for different groups (tenant) presents a significant problem as they do not function as a single unified entity. Although distinct IndexWriters and directories for each group ensures that data belonging to different groups are kept in separate segments and segments within the same group are merged, a unified read-only view for Client (OpenSearch) to interact with these multiple group-level IndexWriters is still needed. Lucene’s addIndexes api offers a way to combine group-level IndexWriters into single parent-level IndexWriter, but this approach has multiple drawbacks:
ProposalTo address this issue, we propose introducing a mechanism that combines group-level IndexWriters as a soft reference to a parent IndexWriter. This will be achieved by creating a new variant of the addIndexes API within IndexWriter, which will only combine the SegmentInfos of group-level IndexWriter without requiring an external lock or copying files across directories. Group-level segments will be maintained in separate directories associated with their respective group-level IndexWriters. The client will periodically call (for OpenSearch side this corresponds to index refresh interval of 1 sec) this addIndexes API on the parent IndexWriter, passing the segmentInfos of child-level IndexWriter as parameters to sync the latest SegmentInfos with the parent IndexWriter. While combining the SegmentInfos of child-level IndexWriters, the addIndexes API will attach a prefix to the segment names to identify the group each Segments belongs to, avoiding name conflicts between segments of different group-level IndexWriters. The parent IndexWriter will be associated with a filter directory that will distinguishes the tenant using the file name prefix, redirecting any read/write operations on a file to the correct group level directory using segment file prefix name. Reason for choosing common view as an IndexWriterMost interactions of Lucene with the client (OpenSearch) such as opening a reader, getting the latest commit info, reopening a Lucene index, etc occurs via IndexWriter itself. Thus selecting IndexWriter as a common view made more sense. Improvements with multiple IndexWriter with a combined viewWe were able to observe around 50% - 60% improvements with multiple IndexWriter with a combined view approach similar to what we observed by having different DWPTs for different tenant (initial proposal). Considerations
Open for thoughts and suggestions. |
@vigyasharma @jpountz @mikemccand Any thoughts on the above approach on using multiple IndexWriter for different group (tenenat) with a read only combined view? |
There is a lot of good work here @RS146BIJAY . Some preliminary questions to understand this better:
|
Thanks @vigyasharma for the feedback.
While writing logs, OpenSearch will interact with n' different log-group specific IndexWriters. For example, if logs are grouped by status codes, a 5xx log entry will be written using a 5xx specific IndexWriter. Conversely for read flows, like creating a reader, retrieving the latest commit (or segmentInfo state) associated with a directory (or IndexWriter) (for uploading to snapshot or syncing the state of replica from primary during checkpoint in SegRep, etc), OpenSearch will interact with Lucene via the combined view (parent IndexWriter). This parent Index Writer internally references segments of group level IndexWriters (200_0, 300_0 etc). Having separate IndexWriters for different groups ensures logs with different groups are maintained in different segments. Meanwhile, the combined view for group-level Segments of a Lucene Index in the form of parent IndexWriter provides a common view for operation like opening readers, syncing replicas, uploading segmentInfos of an index to a remote snapshot etc.
Number of groups (IndexWriters) will be fixed and will be determined via a setting during Index creation.
Having a Multi-Reader on all the child log-group directories still won't provide a unified view of all group level segments associated with a Lucene Index. Even now, OpenSearch interacts with a Lucene index not only for indexing documents or opening a reader to read these indexed docs, but also for retrieving SegmentInfos associated with the latest commit of an IndexWriter directory (for eg: for storing snapshots of an Index on a remote store) or for obtaining file list associated with a past commit (for deleting unreferenced files inside commit deletion policy). Having a common view of multiple group level segments as an Index Writer associated with a single Lucene Index ensures that a Lucene index still behaves as a single entity (parent IndexWriters can be used to get a common commit for group level IndexWriters). Another approach is to use a SegmentInfos instance instead of an IndexWriter as a common view for group level IndexWriters. Since in the above approach, parent IndexWriter periodically syncs and combines only segmentInfos of group-level IndexWriters, we can replace parent IndexWriter with a SegmentInfos as a combined view. This parent SegmentInfos will reference segments of group level segments similar to what a parent IndexWriter does and can be used for opening readers, getting latest commit of a Lucene Index etc. Let me know if this makes sense. |
@vigyasharma @jpountz @mikemccand Any thoughts on the above approach on using multiple IndexWriter for different group (tenenat) with a read only combined view? |
Description
Issue
Today, Lucene internally creates multiple DocumentWriterPerThread (DWPT) instances per index to facilitate concurrent indexing across different ingestion threads. When documents are indexed by the same DWPT, they are grouped into the same segment post flush. As DWPT assignment to documents is only concurrency based, it’s not possible to predict or control the distribution of documents within the segments. For instance, during the indexing of time series logs, its possible for a single DWPT to index logs with both 5xx and 2xx status codes, leading to segments that contains a heterogeneous mix of documents.
Typically, in scenarios like log analytics, users are more interested in a certain subset of data (errors (4XX) and/or fault requests (5XX) requests logs). Randomly assigning DWPT to index document can disperse these relevant documents across multiple segments. Furthermore, if these documents are sparse, they will be thinly spread out even within the segments, necessitating the iteration over many less relevant documents for search queries. While the optimisation to use BKD tree to skip non competitive documents by the collectors significantly improves query performance, actual number of documents iterated still depends on arrangement of data in the segment and how underlying BKD gets constructed.
Storing relevant log documents separately from relatively less relevant ones, such as 2xx logs, can prevent their scattering across multiple segments. This model can markedly enhance query performance by streamlining searches to involve fewer segments and omitting documents that are less relevant. Moreover, clustering related data allows for the pre-computation of aggregations for frequently executed queries (e.g., count, minimum, maximum) and store them as separate metadata. Corresponding queries can be served from the metadata itself, thus optimizing both on the latency and compute.
Proposal
In this proposal, we suggest adding support for DWPT selection mechanism based on a specific criteria within the DocumentWriter. Users can define this criteria through a grouping function as a new IndexWriterConfig configuration. This grouping criteria can be based on the anticipated query pattern in the workload to store frequently queried data together. During indexing, this function would be evaluated for each document, ensuring that documents with differing criteria are indexed using separate DWPTs. For instance, in the context of http request logs, the grouping function could be tailored to assign DWPTs according to the status code in the log entry.
Associated OpenSearch RFC
opensearch-project/OpenSearch#13183
Improvements with new DWPT distribution strategy
We worked on a POC in Lucene and tried integrating it with OpenSearch. We validated DWPT distribution based on different criterias such as status code, timestamp etc against different types of workload. We observed a 50% - 60% improvements in performance of range, aggregation and sort queries with proposed DWPT selection approach.
Implementation Details
User defined grouping criteria function will be passed to DocumentWriter as a new IndexWriterConfig configuration. During indexing of a document, the DocumentWriter will evaluate this grouping function and pass this outcome to the DocumentWriterFlushControl and DocumentWriterThreadPool when requesting a DWPT for indexing the document. The DocumentWriterThreadPool will now maintain a distinct pool of DWPTs for each possible outcome. The specific pool selected for indexing a document will depend on the outcome of the document for the grouping function. Should the relevant pool be empty, a new DWPT will be created and added to this pool. Connecting with above example for http request logs, having a distinct pools for 2xx and 5xx status code logs would ensure that 2xx logs are indexed using a separate set of DWPTs from the 5xx status codes logs. Once a DWPT is designated for flushing, it is checked out of the thread pool and won't be reused for indexing.
Further, in order to ensure that grouping criteria invariant is maintained even during segment merges, we propose a new merge policy that acts as a decorator over the existing Tiered Merge policy. During a segment merge, this policy would categorize segments according to their grouping function outcomes before merging segments within the same category, thus maintaining the grouping criteria’s integrity throughout the merge process.
Guardrails
To mange the system’s resources effectively, guardrails will be implemented to limit the numbers of groups that can be generated from grouping function. User will need to provide a predefined list of acceptable outcomes for the grouping function, along with the function itself. Documents whose grouping function outcome is not within this list will be indexed using a default pool of DWPTs. This limits the number of DWPTs created during indexing, preventing the formation of numerous small segments that could lead to frequent segment merges. Additionally, a cap on DWPT count keeps the JVM utilization and garbage collection in check.
The text was updated successfully, but these errors were encountered: