[FEATURE] Support hybrid scan for Flint covering index #386

dai-chen · 2024-06-18T20:54:05Z

Is your feature request related to a problem?

The Limitation section in this comment discusses how stale index data may lead to incorrect query results. While Flint's skipping index addresses this issue with a hybrid scan mode, this approach is not feasible for covering index with raw datasets due to the lack of versioning.

What solution would you like?

For data lake tables such as Delta, Iceberg, and Hudi, it is possible to implement hybrid scans using the version information in table metadata. For example, consider the following query with a primary key (timestamp column) in an Iceberg table http_logs:

SELECT timestamp, request, status
FROM glue.iceberg.http_logs;

This query can be rewritten to a hybrid scan leveraging Iceberg's time travel capabilities:

During query rewriting, retrieve the Iceberg snapshot ID X from the committed log in the checkpoint. Ref: [FEATURE] Enhance show Flint index statement to include refresh status #385
Rewrite the query to a hybrid scan as shown below:

spark.read
  .format("flint")
  .load("flint_glue_iceberg_http_logs_index")
UNION
spark.read
  .format("iceberg")
  .option("start-snapshot-id", "X")
  .load("glue.iceberg.http_logs")

What alternatives have you considered?

Alternatively, users can disable the covering index optimization by setting spark.flint.optimizer.covering.enabled to false to retrieve the latest data. This ensures that queries always access the most current data, but it may impact query performance due to the lack of indexing benefits.

Do you have any additional context?

N/A

The text was updated successfully, but these errors were encountered:

dai-chen added enhancement New feature or request untriaged labels Jun 18, 2024

dai-chen mentioned this issue Jun 18, 2024

[EPIC] Zero-ETL - Apache Iceberg Table Support #372

Open

dai-chen removed the untriaged label Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support hybrid scan for Flint covering index #386

[FEATURE] Support hybrid scan for Flint covering index #386

dai-chen commented Jun 18, 2024

[FEATURE] Support hybrid scan for Flint covering index #386

[FEATURE] Support hybrid scan for Flint covering index #386

Comments

dai-chen commented Jun 18, 2024