-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Add bloom filter skipping index type #206
Comments
BloomFilter Build PoC (Global Parameters)Basic idea is to reuse Spark's HTTP logs data set for test. There are 1045 files and
Use UDF to convert IP to Long. Test BloomFilter with 100K expected items and default 0.03 FPP:
Bloom filter size ~0.1MB per file (BF with 1000K takes 0.6MB but Flint data source has problem with long binary):
Tested BF with 1000K expected item:
|
Query Rewrite PoCBy default, docValue is disabled for OpenSearch binary field. In this PoC, we create a new index with docValue enabled:
The index size for BF 1M expected item increase to 1.6GB (1045 docs in total => ~1.6MB BF per file)
Push down Spark's
Result:
Verify:
More test:
|
BloomFilter Storage Optimization in OpenSearch
Quick tested the following index but found not much difference:
|
BloomFilter Algorithm Parameter SelectionStrategyHere are different strategies for determining Bloom filter algorithm parameters:
Adaptive Bloom FilterPseudocode:
|
Is your feature request related to a problem?
What solution would you like?
User Experience
Here is the example. See more details in comment below.
Skipping index data in OpenSearch:
Proposed Solution
Design decision from problem space to solution space:
BloomFilterAggregate
andBloomFilterMightContain
function and map to OpenSearch binary.Proof of Concept
PoC branch: https://github.com/dai-chen/opensearch-spark/tree/bloom-filter-skipping-poc
BloomFilterAggregate
andBloomFilterMightContain
) in Flint Spark layerFlintClient
to support binary type and might contain function pushdown (cannot delegate to SQL plugin because it's not Flint dependency at the moment)Benchmark
Preparation
Setup
a. OpenSearch cluster
b. Index setting/mapping: binary field with docvalue enabled
c. Query cache disabled
Test Data
a. http_logs log files in compressed JSON format
b. Total file number = 1045 in different partition on S3 bucket
c. File size: 100K -> 25MB (compressed)
d. clientip column cardinality per file: 1 -> 180K
Test Case: (FPP is always 1%)
a. Static NDV 512K
b. Static 256K NDV: assume user has prior knowledge of max clientip cardinality
c. Adaptive NDV with 10 BloomFilters from 1K to 512K
Test Result
Q2: 4.3 - 5.2
(Prior knowledge of column cardinality)
Q2: 3.8 - 4.3
(10 BFs from 1K to 512K)
Q2: 6.7 - 7.6
Test Result Analysis
a. Generate smallest OS index but building latency and Q2 query is slower than the other 2 configurations
b. For building latency, this is expected because insertion happens on 10 BF internally. This can be optimized by variant BF and discard a BF when it's saturated.
c. For Q2, this is mainly because we use BF NDV right bigger than unique values. With same expected FPP and unique values inserted, bigger NDV decrease the actual FPP. In reality, we may choose lower expected FPP which consume more space as below but has lower FPP. For example, if we choose 0.1% as expected FPP, the OS index size will be doubled to 400M but FPP is much lower and may achieve same performance as Static 256K NDV config.
More Test on FPP
FPP impact on size:
The text was updated successfully, but these errors were encountered: