Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support partial indexing for skipping and covering index #89

Open
dai-chen opened this issue Oct 21, 2023 · 4 comments
Open
Labels
enhancement New feature or request

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented Oct 21, 2023

Is your feature request related to a problem?

Currently there is no way to provide a start timestamp or WHERE clause in create index statement. That means skipping and covering index has to refresh data from the beginning. This may cause unnecessary computation and storage waste.

What solution would you like?

Support partial indexing by either:

  1. Some index option like startTime
  2. Or generic WHERE clause to accept any filtering condition, e.g. CREATE INDEX ... WHERE status != 200 WITH (...)

Note that one challenge for this is the correctness of query rewrite. Skipping index query rewriter has to compare this filtering condition and one in query and decide if the query can be accelerated.

@penghuo
Copy link
Collaborator

penghuo commented Feb 20, 2024

One limit of #124 is that if table does not have partition column, the filter does not work.

Two use case i can think of add Filter when creating skipping index are

  1. Selective Indexing for Large Datasets: Users dealing with extensive datasets may only wish to index a subset of the data. For example, in scenarios where a user possesses three years of data, they might opt to index only the most recent year's dataset.
  2. Backlog Processing: When users are faced with large datasets and decide to create a skipping index for the entire dataset, there's a preference to have the most current data indexed immediately, while historical data is processed incrementally. This approach ensures timely access to the latest information while gradually incorporating historical data into the index.

Proposal 1

  1. User could create table of subset of the data with modifiedAfter options. and create skipping index of the table.
  2. User could
    • for index new files, create skipping index auto_refresh=true, and modifiedAfter for auto refresh of new files.
    • for process backlog files, call REFRESH SKIPPING INDEX on Table with modifiedBefore.

Spark structured streaming does not support modifiedAfter / modifiedBefore. https://issues.apache.org/jira/browse/SPARK-31962

@penghuo
Copy link
Collaborator

penghuo commented Feb 21, 2024

Proposal 2 - leverage file metadata (Preferred)

  • for index new files,
CREATE SKIPPING INDEX on alb_logs
WITH ( auto_refresh=true, notification = SNS://? )
  • for process backlog files, for instance index on S3 has prefix 2023/12
REFRESH SKIPPING INDEX on alb_logs
WHERE _metadata.file_path LIKE '%2023/%'

more reading. https://docs.databricks.com/en/ingestion/file-metadata-column.html

@penghuo
Copy link
Collaborator

penghuo commented Feb 21, 2024

CREATE SKIPPING INDEX on table_name

  • refresh_on_create, specify whether to automatically refresh skipping index after it created.
    • TRUE (default). only when auto_refresh is TRUE.
    • FALSE
  • auto_refresh, specify whether automatically refresh can be scheduled on index
    • TRUE
    • FALSE (default)
  • notification, specify notification mechanism
    • NONE (default)
    • AWS SQS
  • interval, specify refresh interval
    • NONE, schedule immediately after previous batch. (not recommended for external scheduler case)
    • Cron based interval
  • incremental_refresh, incrementally refresh the index if set to true. Otherwise, fully refresh the entire index. This only applicable when auto_refresh = false.
    • TRUE
    • FALSE (false)

REFRESH SKIPPING INDEX on table_name [ WHERE [metadata predicate | partition predicate] ]

On demand refresh skipping index.

  • by default, read files from source table.
  • if notification is configured, read new data from notification

ALTER SKIPPING INDEX on table_name SET auto_refresh = true/false

Specify whether enable / disable auto_refresh skipping index on table. Notes: auto_refresh = false does not stop current running refresh job.

Limitation

  • if there are over 1M objects on s3. user should configure notification. and using REFRESH SKIPPING INDEX on table_name to backfill the old data. for instance, user code execute
    • CREATE SKIPPING INDEX on table_name WITH (auto_refresh = true, notification = sqs)
    • REFRESH SKIPPING INDEX on table_name WHERE _metadata.file_path LIKE ‘%2023%’

@penghuo
Copy link
Collaborator

penghuo commented Feb 26, 2024

Bug found in Spark 3.3.1/3.3.2 of metadata, Bug fixed in Spark 3.4 apache/spark#39870

SELECT _metadata, *
FROM alb_logs 
WHERE _metadata.file_path like '%2023/11/09%'
LIMIT 1;

org.apache.spark.sql.AnalysisException: Column '_metadata' does not exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

2 participants