[FEATURE] Avoid expensive S3 listing by using file list in skipping index #218

dai-chen · 2024-01-10T22:08:30Z

Is your feature request related to a problem?

Currently, any batch or streaming job on Spark table incurs expensive S3 listing to generate the input file list. Hive table has its own partition information in catalog but requires MSCK refresh this information manually and on a regular basis.

What solution would you like?

Actually only the S3 listing in skipping index is inevitable. Any direct query or streaming refreshing in covering index and materialized view doesn't need to do this again (unless strong consistency with latest source file is required).

In this case, the source file list seen so far can be found in file path column in Flint skipping index. The challenge is just to figure out if we can reconstruct FileStatus for Spark.

The text was updated successfully, but these errors were encountered:

dai-chen added enhancement New feature or request untriaged labels Jan 10, 2024

dai-chen mentioned this issue Jan 10, 2024

[Feature] OpenSearch and Apache Spark Integration #3

Closed

dai-chen removed the untriaged label Jan 10, 2024

dai-chen added this to OpenSearch Spark Project Planning Jan 10, 2024

dai-chen moved this to Todo in OpenSearch Spark Project Planning Jan 10, 2024

dai-chen mentioned this issue Jan 10, 2024

[RFC] Proposal: introduce new Flint Table concept #217

Open

dai-chen added the 0.3 label Mar 7, 2024

dai-chen mentioned this issue Mar 7, 2024

[FEATURE] Skipping index and materialized view refresh synchronization #93

Open

brijos mentioned this issue Mar 12, 2024

[META] Bucket of Improvements to the Spark Integration for 2.13 #277

Closed

dai-chen mentioned this issue Jun 3, 2024

[FEATURE] Performance and Scalability Enhancements for Flint Index #365

Open

dai-chen added DataSource:File and removed 0.3 labels Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Avoid expensive S3 listing by using file list in skipping index #218

[FEATURE] Avoid expensive S3 listing by using file list in skipping index #218

dai-chen commented Jan 10, 2024

[FEATURE] Avoid expensive S3 listing by using file list in skipping index #218

[FEATURE] Avoid expensive S3 listing by using file list in skipping index #218

Comments

dai-chen commented Jan 10, 2024