-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Context Aware Segments #13183
Comments
This is a great strategy for reducing data volume with sparse data while also enhancing query performance. In the case of highly sparse data, such as with the http-error-codes example, very sparse fields could be even treated as constants within the metadata, eliminating the necessity to store any field values (e.g.store only the '200' metadata for segments that contain only the very common 200 error codes). |
@RS146BIJAY Could you share benchmarks with the POCs you have linked? |
I think I have too many questions but one that probably stands out right now is regarding grouping predicate. This is essentially an arbitrary piece of code, right? What means user would use to supply this grouping predicate (arbitrary code) for particular index / indices + merge policy to OpenSearch? |
@reta The grouping predicate will be configurable for an index and won't be a standalone code (the above POC is for when grouping criteria is day based grouping for threshold = 300 MB). This grouping criteria passed by user will determine which set of data will be grouped together. The details around exact structure of predicate will be published separately as a separate github issue. |
Lets link this with #12683 |
For HTTP logs workload and daily groupingBulk Indexing Client = 8 Grouping criteriaIn this scenario we use a day based grouping for Performance Benchmark MetricsIndexingSize of indexCompression works more effectively for context aware segments, resulting in an improvement of approximately 19% in the size of indices over Tiered merge and log byte size merge policy. This happens because data is nearly ordered in Context Aware segments. LatencyThere is a minor improvement (3% - 6%) in indexing latency when we group hourly logs together during flush time and merge them into daily segments in increasing order of hour inside DataAware merge policy. This is because hourly segments are merged into daily segments only when total size of segments within that particular day is above 300 MB limit. Segment CountWith DataAware merge policy (day based grouping) and hourly flushing, segment count tends to increase (about 4 to 5 times) for SearchFor context aware segments, we see a significant improvement in performance for both range and ascending sort queries. This is because data is sorted in near-ascending order within Context Aware Segment. On the flip side, the efficiency of descending sort queries regresses with this method. To fix this regression in desc sort queries, a potential solution is to traverse segment data in reverse order (will create a separate lucene issue for this and link it here).
Segment mergesIntemediary number of segments is higher with Data Aware merge policy as compared to Tiered and LogByteSize merge policies. Below metrics on merge size (order is Tiered, LogByteSize and DataAware from top to bottom) shows that while Tiered merge policies do allow merging larger segments during indexing, Data Aware merge policy initially merges smaller hourly segments (keeping segment count high). And only once segment size for a day exceeds 300 MB, Data Aware merge policy shifts to merging larger segments. CPU, JVM and GC countCPU and JVM remains same for all the merge policies during entire indexing operation. GC count for DataAware merge policies is slightly higher due to fact that we are allocating different DWPTs for logs with different timestamps (hour) in DocumentWriter. (order is Tiered, LogByteSize and DataAware from top to bottom) |
For HTTP logs workload and grouping by status codeBulk Indexing Client =8 Grouping criteriaIn this setup, we implemented grouping based on status code for the http_logs workload. Here we separated logs of successful requests from those with error (4xx) and fault(5xx) status codes. Segments are initially flushed at per status code level (sub grouping criteria). Context Aware merge policy will start merging error and fault status segments together Performance Benchmark MetricsIndexingSize of indexSince with just status based grouping, data is not as ordered as previous case of Day based grouping with hourly sub grouping, we observe around 3% improvement in index size with DataAware merge policy. LatencyIndexing latency remains almost same as Tiered and LogByteSize merge policy. Segment CountSince there are only two groups of successful (2xx) and failed status segments (4xx and 5xx), number of segments is almost same as Tiered and log byte size merge policy. SearchWith status code based grouping strategy, we see a considerable improvements in performance of range, aggregation and sort queries (order by timestamp) for fault/error logs within specific time range. This efficiency is attributed to lesser number of iterations to find fault/error logs as they are spread across fewer segments as compared to Tiered and LogByteSize merge policies.
With DataAware we are iterating fewer number of times (documents + segments) across segments to locate error/fault logs:
|
Would appreciate any initial feedback from @shwetathareja, @nknize, @Bukhtawar, @sachinpkale, @muralikpbhat, @reta, @msfroh and @backslasht as well, so tagging more folks for visibility. Thanks! |
For HTTP logs workload and daily groupingBulk Indexing Client = 1 Performance Benchmark MetricsIndexingSize of indexSince data is completely sorted by timestamp, with DataAware merge policy there is no considerable improvement in size of index. LatencyIndexing latency remained almost same with DataAware merge policy as Tiered and LogByteSize merge policy. Segment CountSimilar to case when bulk_indexing_client = 8, with DataAware merge policy (day based grouping) and hourly flushing, segment count tends to increase (about 4 to 5 times) for http_logs workload. SearchSince logs are order in increasing order of timestamp, there is no significant difference in the For asc sort and desc sort query with search after timestamp, DataAware performs better over LogByteSize merge policy. This is because of a bug in skipping logic while scoring document inside Lucene and improvement is not specifically related to DataAware segments.
|
Thanks @RS146BIJAY , the numbers look promising, but don't we violate the DPWT design by having man-in-the-middle (grouping) here? To elaborate on that, my understanding is that DW uses DPWT to associate the ingest threads with writer threads, with intention to eliminate synchronization (this is also highlighted in DW javadocs). With grouping, this won't be the case anymore - the grouping layer adds the indirection where multiple threads would be routed to the same DPWT. Is that right or my understanding is not correct? Thank you. |
I had a similar concern @reta but the lock-free model of DWPT can still be improvised/matched with creating just enough number of DWPTs that can write concurrently to minimise contention or create more instances on demand if the lock is already acquired. So yes there needs to be some coordination but shouldn't directly require synchronisation. |
Not excatly. To add to what @Bukhtawar mentioned, before this change DWPT Pool maintains a list of free DWPT on which no lock is present and no write is happening. Incase all DWPTs are locked, a new instance of DWPT is created on which write happens. With our change, this free list is maintained at individual group level inside DWPT pool. If there is no free DWPT for a group, a new instance of DWPT for that group will be created and write will happen on it. So active number of DWPT in our case will be higher than what it was earlier, but each thread even now will be routed to different DWPT. |
Adding few more scenarioes, For HTTP logs workload and status grouping (bulk indexing client = 1)Bulk Indexing Client = 1 Grouping criteriaIn this setup, we implemented grouping based on status code for the http_logs workload with bulk_indexing_client = 1. Here we separated logs of successful requests from those with error (4xx) and fault(5xx) status codes. Segments are initially flushed at per status code level (sub grouping criteria). Context Aware merge policy will start merging error and fault status segments together Performance Benchmark MetricsIndexingSize of indexSince data is completely sorted by timestamp, with DataAware merge policy there is no considerable improvement in size of index. LatencyIndexing latency remained almost same with DataAware merge policy as Tiered and LogByteSize merge policy. Segment CountSince there are only two groups of successful (2xx) and failed status segments (4xx and 5xx), number of segments is almost same as Tiered and log byte size merge policy. SearchSimilar to scenario when bulk_indexing_client > 1, when there is a single bulk indexing client and with status code based grouping strategy, we see a considerable improvements in performance of range, aggregation and sort queries (order by timestamp) for fault/error logs within specific time range. This efficiency is again attributed to lesser number of iterations to find fault/error logs as they are spread across fewer segments as compared to Tiered and LogByteSize merge policies.
|
@RS146BIJAY : We should also explore if the predicate can be deduced from user's frequent queries. |
As a part of first phase, we will ask this as an input from the user itself. We will eventually explore how we can automatically determine grouping criteria based on Cx workload |
@RS146BIJAY so it becomes function of a group? and in this case, if the cardinality of the group is high, we could easily OOM the JVM, right? we need the guardrails in place |
@reta Yes. We will be implementing proper guardrails on grouping criteria to not allow too small or too large groups. |
Abstract
This RFC proposes a new context aware/based segment creation and merging strategy for OpenSearch to improve query performance by co-locating related data within same physical segments, particularly benefiting log analytics and metrics use cases.
Motivation
In OpenSearch, a typical workload involves log analytics and metrics data, where for the majority of search queries, only a subset of the data is more relevant. For instance, when analyzing application logs, users are often more interested in errors (4XX) and/or fault requests (5XX) requests, which generally constitute only a minor portion of the logs. Current segment creation (via flush) and merging policies/strategies (both Tiered and LogByteSize) does not incorporate the anticipated query context while grouping data into segments.
This leads to segments containing a mix of relevant and less relevant documents. We can improve query performance by grouping relevant data together and removing this dispersion as:
Proposal
The solution introduces Context aware/based segments that group related data in the same physical segments. This grouping is achieved through a user defined predicate which can be specified as a configuration. This grouping through predicate evaluation occurs during both flush and segment merge flows, ensuring that related data is consistently co-located in the same segments.
Example Use case
For application request logs, if anticipated queries will be on status codes, user can define a predicate to group data based on status codes as a configuration (like group all successful(2xx), faults(4xx) and error(5xx) status codes separately). This will ensure that during indexing, same DWPT gets assigned to log entry with same status code. ContextAware merge policy will ensure segment with same status codes get merged together. Consequently, search queries like “number of faults in the last hour” or “number of errors in the last three hours” will be more efficient, as they will need to only process segments with 4xx or 5xx status codes, which will be a much smaller dataset, improving query performance.
Merging segment groups
The Context aware/based merge policy employs a hierarchical merging strategy to merge segments evaluated within the same group (based on the configured predicate). This strategy orchestrates the merging process across multiple hierarchical levels, in a way that reflects the natural hierarchy of data attributes. In this approach:
This approach ensures data within segments is nearly ordered, improving query performance as skipping non competitive documents via BKD works best when data is sorted. Moreover, this strategy reduces the frequency of merges, as merges to higher level are only executed upon reaching the threshold, thereby enhancing indexing performance. The trade-off, however, is an increased number of initial segments.
Considerations
POC Links
grouping criteria: day based grouping for threshold = 300 MB
OpenSearch changes: main...RS146BIJAY:OpenSearch:data-aware-merge-policy
Lucene changes: apache/lucene@main...RS146BIJAY:lucene:grouping-segments
Next Steps
Would appreciate any feedback on the overall idea and proposal. We are in the process of assessing benchmarks for memory usage, disk space usage, throughput and latency with this optimization. We will compile the results and publish it soon.
The text was updated successfully, but these errors were encountered: