Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict the maximum size of value set by default limit #208

Merged

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Dec 27, 2023

Description

  1. Index building: added check on output size of collect set function, specifically change SELECT COLLECT_SET(col) ... to SELECT IF(ARRAY_SIZE(COLLEC_SET(col)) > default limit 100, null, COLLECT_SET(col))
  2. Query rewrite: rewrite WHERE condition to value set contains the given value or value set is NULL
  3. Updated user manual

TODO

  • 2nd PR: Make limit configurable in SQL DDL statement
  • 3rd PR: Add new CollectSetLimit function to reduce memory consumption [TBD]

Testing

Hardcoding max size to 2 for test:

spark-sql>
... CREATE TABLE stream.value_set_test
... (name STRING)
... USING CSV
... LOCATION 's3://.../value_set_test/';

spark-sql>
... INSERT INTO stream.value_set_test 
... SELECT /*+ COALESCE(1) */ *  # make sure all values insert into single file
... FROM VALUES ('hello'), ('world');

spark-sql>
... INSERT INTO stream.value_set_test
... SELECT /*+ COALESCE(1) */ *
... FROM VALUES ('hello'), ('world'), ('test');

spark-sql>
... CREATE SKIPPING INDEX ON stream.value_set_test
... (name VALUE_SET)
... WITH (auto_refresh = true);

Check skipping index data in OpenSearch:

POST flint_myglue_stream_value_set_test_skipping_index/_search
    ...
    "hits": [
      {
        "_index": "flint_myglue_stream_value_set_test_skipping_index",
        "_id": "5ca850fcaa03bc42a39edb7d3fed56ada5ac2083",
        "_score": 1,
        "_source": { # This doc doesn't have value set field due to limit reached
          "file_path": "s3://.../value_set_test/part-00000-e9a73f50-69f7-4a16-bdef-bd401d294e04-c000.csv"
        }
      },
      {
        "_index": "flint_myglue_stream_value_set_test_skipping_index",
        "_id": "02d492e74dfbde571d9dfdbf7552e0180a17ddc6",
        "_score": 1,
        "_source": {
          "file_path": "s3://.../value_set_test/part-00000-8bfcd46b-2975-4074-ba9b-e9de27804bbf-c000.csv",
          "name": [
            "world",
            "hello"
          ]
        }
      }
    ]

Query rewrite:

spark-sql> SELECT input_file_name() FROM stream.value_set_test WHERE name = 'hello';
s3://.../value_set_test/part-00000-e9a73f50-69f7-4a16-bdef-bd401d294e04-c000.csv
s3://.../value_set_test/part-00000-8bfcd46b-2975-4074-ba9b-e9de27804bbf-c000.csv

spark-sql> SELECT input_file_name() FROM stream.value_set_test WHERE name = 'test';
s3://.../value_set_test/part-00000-e9a73f50-69f7-4a16-bdef-bd401d294e04-c000.csv

spark-sql> EXPLAIN SELECT input_file_name() FROM stream.value_set_test WHERE name = 'test';
== Physical Plan ==
*(1) Project [input_file_name() AS input_file_name()#17]
+- *(1) Filter (isnotnull(name#0) AND (name#0 = test))
   +- FileScan csv stream.value_set_test[name#0] Batched: false,
            DataFilters: [isnotnull(name#0), (name#0 = test)], Format: CSV,
            Location: FlintSparkSkippingFileIndex(1 paths)[s3://.../value_set_test],
            PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,test)],
            ReadSchema: struct

Issues Resolved

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dai-chen dai-chen added the bug Something isn't working label Dec 27, 2023
@dai-chen dai-chen self-assigned this Dec 27, 2023
@dai-chen dai-chen marked this pull request as ready for review December 28, 2023 01:59
@dai-chen dai-chen added the 0.2 label Jan 5, 2024
@dai-chen dai-chen merged commit c72d773 into opensearch-project:main Jan 5, 2024
4 checks passed
@dai-chen dai-chen deleted the restrict-collect-set-size branch January 5, 2024 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2 bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants