-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance Idea] Granular Choice for Stored Fields Compression Algorithm #11605
Comments
@shwetathareja @backslasht @msfroh Looking forward to your thoughts on this. |
Thanks @mgodwan for the proposal.
Trying to understand the use case better. What would be a real world scenario for this? |
One of the examples I can think of is around eCommerce-order/payment transaction data, where clients may start with creating a document, and as the lifecycle of order/transaction changes (e.g. payment gets authorized/captured), the updates are applied within a few seconds to the same document. |
Merging is continuous process. Along with immediate updates use cases (which are done within very short period of time (say in secs) after document creation), this should also benefit the merging of newly created segments as they go through multiple merging cycles and would save on decompression/ compression overhead. |
@shwetathareja - Interesting thought. Are you suggesting to do compression only when the segment under creation is of size greater than X bytes? @mgodwan - is this only applicable for Trying to gauge the advantage it provides considering the complexity it brings on the decision logic to compress or not (and/or possibly compress using a faster compression algorithm)? |
@backslasht This was the intended solution with this proposal as over time, older smaller segments will be merged into larger segments. The solution can be to either skip the compression/use a faster compression for smaller/recent segments, and storage-optimizing compression for older/larger segments.
I'll need to measure the trade-off between disk latency/IOPS and CPU usage, but characteristically speaking, we will need to analyse how this behaves for all compression algorithms. LZ4 usually takes lesser CPU for compression/decompression, but the overall indexing throughput observed is lesser as well there. |
Thanks @mgodwan for the explanation. I agree, this idea is worth exploring, I would lean towards size of the segment (instead of recency of the document) to decide on the compression logic. |
Coming here from #13110. Sharing some numbers around the experiments. Benchmarks:
Note: +ve means improvement, -ve means degradation from the current behaviour. Variance in Storage during the indexing of NYC Taxis Workload There are steeper dips in the storage of the disk but it quickly recovers as we reach the 64 mb segment size thresholds. Benchmarking Setup:
|
Thanks @sarthakaggarwal97 for the experiments. It is interesting to see the impact on read and write IOPS when compression is disabled for smaller segments. Couple of follow up questions.
|
Using a similar benchmarking setup as here, I employed an custom workload to trigger index and search simultaneously. Sharing the benchmarks around it: We are introducing hybrid compression in stored fields, and most of the queries available across different workloads (nyc_taxis, http_logs etc) do not directly search on the fdt files. Moreover, the search would be performant for the more recently indexed data, since we will save up on decompression compute. For the already indexed data, and compressed during merges, we dod not see any regression. |
Explored introducing the idea of hybrid compression during flushes like merges. Broadly, there are two reasons with which we can trigger a stored fields flush. Either, with a flush triggered via external factors like refresh, translog flush etc or via internal stored fields writer when the Chunk Size and the Number of Documents thresholds are met. Here, whenever the flush was triggered by internal factors, the data was not compressed. The thresholds for the chunk size were increased to few mbs from present 8kb. If the flush happened due to external factors, we checked if the data is already written in the fdt file of the segment, if yes, we would pick the compression (codec compression or no compression) type chosen else we will compress using codec as defau;t. Sharing the POC implementation for Hybrid Flushes and Merges: |
Is your feature request related to a problem? Please describe.
Today, an operator can set
index.codec
setting which applies the required compression technique based on the value set by operator (e.g. zstd, deflate, lz4) . Whenever a segment is written, the configured compression algorithm is applied by Lucene for the stored fields.For use cases where customer is looking for updates on recently ingested documents getting recent documents, it may be better to not compress as we need to retrieve the stored
_source
field which needs to be decompressed and may require additional CPU usage.Once the segments are merged into larger segments, and the documents may not be frequently accessed, we can compress to take advantage of the required storage size and reduce write amplification due to lesser size.
The lever can be based on different parameters such as FieldInfo, Segment Size, etc. and can be incorporated into merge policy or Per Field codec configuration, and not just based on temporal locality.
Describe the solution you'd like
This is a rough idea I'm looking for feedback on if it may be a good idea to explore.
Additional context
I was analyzing the performance of an update benchmark workload where I witnessed recently ingested documents were being retrieved for updates and spending CPU cycles in decompression.
The text was updated successfully, but these errors were encountered: