[FEATURE] Support partial indexing for skipping and covering index #89

dai-chen · 2023-10-21T01:06:57Z

Is your feature request related to a problem?

Currently there is no way to provide a start timestamp or WHERE clause in create index statement. That means skipping and covering index has to refresh data from the beginning. This may cause unnecessary computation and storage waste.

What solution would you like?

Support partial indexing by either:

Some index option like startTime
Or generic WHERE clause to accept any filtering condition, e.g. CREATE INDEX ... WHERE status != 200 WITH (...)

Note that one challenge for this is the correctness of query rewrite. Skipping index query rewriter has to compare this filtering condition and one in query and decide if the query can be accelerated.

The text was updated successfully, but these errors were encountered:

penghuo · 2024-02-20T00:23:25Z

One limit of #124 is that if table does not have partition column, the filter does not work.

Two use case i can think of add Filter when creating skipping index are

Selective Indexing for Large Datasets: Users dealing with extensive datasets may only wish to index a subset of the data. For example, in scenarios where a user possesses three years of data, they might opt to index only the most recent year's dataset.
Backlog Processing: When users are faced with large datasets and decide to create a skipping index for the entire dataset, there's a preference to have the most current data indexed immediately, while historical data is processed incrementally. This approach ensures timely access to the latest information while gradually incorporating historical data into the index.

Proposal 1

User could create table of subset of the data with modifiedAfter options. and create skipping index of the table.
User could
- for index new files, create skipping index auto_refresh=true, and modifiedAfter for auto refresh of new files.
- for process backlog files, call REFRESH SKIPPING INDEX on Table with modifiedBefore.

Spark structured streaming does not support modifiedAfter / modifiedBefore. https://issues.apache.org/jira/browse/SPARK-31962

penghuo · 2024-02-21T01:55:41Z

Proposal 2 - leverage file metadata (Preferred)

for index new files,

CREATE SKIPPING INDEX on alb_logs
WITH ( auto_refresh=true, notification = SNS://? )

for process backlog files, for instance index on S3 has prefix 2023/12

REFRESH SKIPPING INDEX on alb_logs
WHERE _metadata.file_path LIKE '%2023/%'

more reading. https://docs.databricks.com/en/ingestion/file-metadata-column.html

penghuo · 2024-02-21T19:18:28Z

CREATE SKIPPING INDEX on table_name

refresh_on_create, specify whether to automatically refresh skipping index after it created.
- TRUE (default). only when auto_refresh is TRUE.
- FALSE
auto_refresh, specify whether automatically refresh can be scheduled on index
- TRUE
- FALSE (default)
notification, specify notification mechanism
- NONE (default)
- AWS SQS
interval, specify refresh interval
- NONE, schedule immediately after previous batch. (not recommended for external scheduler case)
- Cron based interval
incremental_refresh, incrementally refresh the index if set to true. Otherwise, fully refresh the entire index. This only applicable when auto_refresh = false.
- TRUE
- FALSE (false)

REFRESH SKIPPING INDEX on table_name [ WHERE [metadata predicate | partition predicate] ]

On demand refresh skipping index.

by default, read files from source table.
if notification is configured, read new data from notification

ALTER SKIPPING INDEX on table_name SET auto_refresh = true/false

Specify whether enable / disable auto_refresh skipping index on table. Notes: auto_refresh = false does not stop current running refresh job.

Limitation

if there are over 1M objects on s3. user should configure notification. and using REFRESH SKIPPING INDEX on table_name to backfill the old data. for instance, user code execute
- CREATE SKIPPING INDEX on table_name WITH (auto_refresh = true, notification = sqs)
- REFRESH SKIPPING INDEX on table_name WHERE _metadata.file_path LIKE ‘%2023%’

penghuo · 2024-02-26T22:41:54Z

Bug found in Spark 3.3.1/3.3.2 of metadata, Bug fixed in Spark 3.4 apache/spark#39870

SELECT _metadata, *
FROM alb_logs 
WHERE _metadata.file_path like '%2023/11/09%'
LIMIT 1;

org.apache.spark.sql.AnalysisException: Column '_metadata' does not exist.

dai-chen added enhancement New feature or request untriaged labels Oct 21, 2023

dai-chen self-assigned this Oct 21, 2023

dai-chen added 0.1.1 and removed untriaged labels Oct 21, 2023

dai-chen added this to OpenSearch Spark Project Planning Oct 31, 2023

dai-chen moved this to Todo in OpenSearch Spark Project Planning Oct 31, 2023

dai-chen moved this from Todo to In Progress in OpenSearch Spark Project Planning Nov 2, 2023

This was referenced Nov 2, 2023

Add where clause support for covering index #85

Merged

Add partial indexing support for skipping index #124

Closed

dai-chen moved this from In Progress to Todo in OpenSearch Spark Project Planning Nov 3, 2023

dai-chen moved this from Todo to In Progress in OpenSearch Spark Project Planning Nov 4, 2023

dai-chen moved this from In Progress to Under Review in OpenSearch Spark Project Planning Nov 10, 2023

dai-chen added 0.2 and removed 0.1.1 labels Nov 27, 2023

dai-chen mentioned this issue Dec 13, 2023

[Feature] OpenSearch and Apache Spark Integration #3

Closed

dai-chen moved this from Under Review to In Progress in OpenSearch Spark Project Planning Feb 6, 2024

dai-chen added 0.3 and removed 0.2 labels Mar 7, 2024

dai-chen moved this from In Progress to Todo in OpenSearch Spark Project Planning Mar 7, 2024

brijos mentioned this issue Mar 12, 2024

[META] Bucket of Improvements to the Spark Integration for 2.13 #277

Closed

penghuo mentioned this issue May 6, 2024

[FEATURE] Support Spark 3.4.1 #331

Closed

9 tasks

dai-chen mentioned this issue Jun 3, 2024

[FEATURE] Performance and Scalability Enhancements for Flint Index #365

Open

dai-chen removed the 0.3 label Jun 20, 2024

dai-chen removed their assignment Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support partial indexing for skipping and covering index #89

[FEATURE] Support partial indexing for skipping and covering index #89

dai-chen commented Oct 21, 2023 •

edited

Loading

penghuo commented Feb 20, 2024 •

edited

Loading

penghuo commented Feb 21, 2024

penghuo commented Feb 21, 2024

penghuo commented Feb 26, 2024

[FEATURE] Support partial indexing for skipping and covering index #89

[FEATURE] Support partial indexing for skipping and covering index #89

Comments

dai-chen commented Oct 21, 2023 • edited Loading

penghuo commented Feb 20, 2024 • edited Loading

penghuo commented Feb 21, 2024

penghuo commented Feb 21, 2024

penghuo commented Feb 26, 2024

dai-chen commented Oct 21, 2023 •

edited

Loading

penghuo commented Feb 20, 2024 •

edited

Loading