-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Index size on opensearch is bigger then elasticsearch #3769
Comments
Can you confirm the index codec being used in both cases? I'm not sure if it is related, but there was a reduction in block size in Lucene that was introduced in 8.10 that improved performance at the cost of not as good of a compression ratio. The upshot is that Elasticsearch 7.10/OpenSearch 1.0 would have the better compression ratio but slower queries as compared to newer versions of OpenSearch if using "best_speed" (or "default") index codec. |
+1. We observed the same issue .. there is 10 to 15% increase in size when OpenSearch is used for the same dataset indexed to ES. I think this can be easily reproduced by using a common dataset and index to both. We also noticed an equal amount of regression in time as well. When it comes to GBs of data the time aspect is critical as well. |
Tagging @sarthakaggarwal97 to take a look. |
Took a dive into Lucene to understand what has changed between ES 7.10 and OS 2.x for the stored fields (.fdt files) The prominent difference in (.fdt) files can be explained in few of the changes that were made by Lucene in terms of block size since Lucene v8.7.0 In the Lucene 8.7.0 release, the block size that was used to compress looks to be 60kb. Since Lucene 8.10, the block size has changed to 8kb for BEST_SPEED codec. As mentioned by @andrross, the change was made in the interest of performance with the trade offs in storage Sometime back, we had benchmarked the performance of different block sizes: #7475 (comment) It was noticed that we observed some improvements in performance and store size with 16K block size. I'll take a stab at it again with different block sizes to find the optimal configuration. |
I am able to reproduce roughly 7-10% of the difference in the size of the stored fields between ES 7.10 & OS 2.x using the nyc_taxis dataset. |
I ran some more benchmarks, and would like to share the results. With the mappings share by @oded-dd, I ran two types of experiments, and the results are quite interesting. Experiment #AUsing the index mappings, I took a sample document and ingested duplicate documents to both ES 7.10 (60K Block Size), OS 2.11 (8K Block Size), OS 2.11 (60K Block Size). Post ingestion, the index was force merged into a single segment to derive accurate comparison.
Since the values of the fields in the documents were totally same, it was expected to have a much higher compression ratio. With this, we are able to root cause the increase in the fdt files (using the mappings of @oded-dd) between ES 7.10 and OS 2.11 seems to be the block size change at the Lucene. Experiment #BIn this experiment, I took the same index mappings but ingested unique values of the fields in the documents. Even this time, I ingested 30M unique documents with the same mappings to both ES 7.10 and OS 2.11.
With unique documents, the compression ratio does not seem to vary very significantly with just the switch of block size from 60K to 8K.
Summary
Why did Lucene changed the value of block size from 60K to 8K?I ran the experiments using the top_hits which queries over stored fields, and found out that 8K outperforms other block sizes. Note: This is specific to queries over stored fields, in general, queries don't really rely on stored fields much. This confirms the change of block size was attributed to improved query performance. |
I re-performed the experiments for the similar documents using the mappings in the issue with a larger dataset (7gb in ES 7.10)
This summarizes that Zstandard (level 3) saves up on storage by 24% when compared to ES 7.10. Currently, OpenSearch supports 6 compression levels, where with an increase in compression level, we tend to see higher compression ratio at a trade off for speed. |
Thanks @sarthakaggarwal97 for sharing the storage numbers. What is the performance impact wrt throughput due to the zstd level3 and zstd no dict level3 in the benchmarks you executed recently? Is it inline with https://opensearch.org/docs/latest/im-plugin/index-codecs/#benchmarking for the workload you tested as well? |
Indexing Performance During the indexing of these duplicate documents the CPU of the data node roughly hovered around 20-25% for all the runs. Average Indexing Rate during the runs:
Query Performance Sharing numbers for the query focused on top of stored fields for the NYC Taxis Dataset. Zstandard compression is providing optimal query performance alongside improved compression ratio over LZ4 with 60K Block size (ES 7.10). |
Thanks @sarthakaggarwal97 for the detailed analysis. While Zstandard is a good fit for this use case (duplicate values), what are your thoughts on exposing block size as a configurable parameter? |
Currently, there is not an easy way to expose block size as configurable parameter. In order to support this, we would be required to have additional maintenance overhead upon lucene minor/major version release. Given that we have zstandard compression codecs as an alternative for such cases in the custom-codecs plugin, we may not need to expose this configuration as a setting. |
Considering an alternate (Zstandard) is already available to meet this use case, I agree introducing a new setting via custom codec is an additional maintenance overhead. We can revisit the decision if a new pressing need arises. |
In another experiment, I changed two fields ~ id and time, in order to replicate real world scenarios
|
Describe the bug
Opensearch - _.fdt lucene file storage is much bigger then the same index on Elasticsearch
To Reproduce
While comparing between docs size on opensearch (2.0.1) vs elasticsearch (7.10.2) we came with the following results:
Scenario: Indexing identical dataset.
The dataset is comprised of the same data, the same index mapping, and the same settings
In this scenario there was a big increase in index sizing on opensearch (2.0.1) compared to elasticsearch (7.10.2)
We noticed that the _.fdt file is larger in opensearch
2.0.1
See attached the filesystem statistics and file sizes
also I add my script that ingest data to elasticsearch & opensearch .
we also take snapshot from elasticsearch
7.10.2
and restore to opensearch2.0.1
and the size is close to each other .Index with source stats on elasticsearch 7.10.2
_cat/indices/index_with_source_16?format=json&bytes=b
Index with source stats opensearch 2.0.1
_cat/indices/index_with_source_16?format=json&bytes=b
Expected behavior
Index on opensearch should be identical in size to the same index on elasticsearch
Additional Files
indicesstats.zip
Host/Environment:
Ubuntu 20.04
The text was updated successfully, but these errors were encountered: