[RFC] Proposal: introduce new Flint Table concept #217

dai-chen · 2024-01-10T21:45:05Z

Introduction

Currently the fundamental component accessible to users is the Flint Index. In essence, a Flint index represents a derived dataset sourced along with its metadata log for index state management. Physically, the data associated with each Flint index resides within an individual OpenSearch index, ex. flint_{table_name}_skipping_index, while its metadata log is stored as a single document in the shared .query_request_index_{datasource} index.

The current user workflow involves initially creating a Spark table or utilizing existing ones. Subsequently, a Flint index is created on this table. Taking the example of a skipping index, the process is illustrated below:

CREATE TABLE myglue.default.alb_logs
(
    type string,
    time string,
    elb string,
    client_ip string,
    ...,
)
USING JSON
OPTIONS (
  path 's3://alb_logs/'
);

CREATE SKIPPING INDEX ON myglue.default.alb_logs
(
  year PARTITION,
  month PARTITION,
  day PARTITION,
  request_processing_time MIN_MAX,
  client_ip VALUE_SET
)
WITH ( auto_refresh = true );

The utilization of Flint indexes empowers users to opt for various optimization techniques such as skipping index, covering index, and materialized view creation, all aimed at enhancing data retrieval speed from the source table.

Problem Statement

The existing Flint index and user workflow exhibit the following challenges:

Performance
- The process of S3 listing during source file discovery incurs high costs. This is presently triggered with each Flint index refresh and therefore leading to performance bottlenecks. [FEATURE] Avoid expensive S3 listing by using file list in skipping index #218
- The coordination between Flint index streaming refresh is also a challenge. [FEATURE] Skipping index and materialized view refresh synchronization #93
Compatibility
- Users encounter difficulties with creating Flint index on varying Hive table structures. [FEATURE] Incremental refresh index on Hive source table #91
- Users cannot use partitioned dataset that doesn't follow Hive convention in Spark/Flint. [FEATURE] Non hive partition discovery and update. #212
Usability
- Users often lack guidance in selecting the appropriate type of Flint index and skipping algorithm to create.
- For semi-structured data, users may not know the schema until query time. The current need to create a table with all columns causes inconvenience.

Proposed Solution

In this document, our proposal focuses on the introduction of the Flint Table, a higher-level abstraction compared to the Flint Index. Beyond encompassing table schema information and data present in other Spark tables, a Flint table incorporates a new Table Metadata concept enhanced by leveraging the Flint skipping index.

TBD: 1) register metadata as SQL table; 2) covering index/MV under same Flint table concept or not

SQL Interface [TBD]

CREATE TABLE flint_table_name
( column_list )
USING Flint
OPTIONS (
  flint_table_options
);

Examples

Now user can simply create a Flint table and immediately initiate queries on it:

CREATE TABLE myglue.default.alb_logs
(
    type string,
    time string,
    elb string,
    client_ip string,
    ...,
)
USING Flint
OPTIONS (
  path='s3://alb_logs/',
  projection='s3://alb_logs/year=${year}/month=${month}/day=${day}' # partition projection
)
PARTITIONED BY (year, month, day);

For semi-structured data, users can effortlessly create a Flint table, specifying only the columns they intend to use in their subsequent queries:

CREATE TABLE app_logs
(
  @timestamp datetime,
  @message string # field capture entire log line
)
...

SELECT ... FROM alb_logs WHERE status = 200 # implicit @message

Currently assume there is no changes on covering index and materialized view side:

CREATE INDEX elb_status_and_clientip_idx
ON myglue.default.alb_logs
( elb_status_code, client_ip )
WHERE year = 2023 AND elb_status_code = 400
WITH (auto_refresh = true);

CREATE MATERIALIZED VIEW alb_logs_mv
AS
  SELECT window.start_time, COUNT(*)
  FROM myglue.default.alb_logs
  WHERE year = 2023 AND elb_status_code = 400
  GROUP BY TUMBLE(timestamp, 'interval')
WITH (auto_refresh = true);

While the immediate enhancement is the simplified process eliminating the step of creating a Flint index explicitly, the actual influence extends far beyond. As discussed in the upcoming Functionalities section, the semantics of the proposed Flint Table solution introduces a range of features. These features not only streamline the workflow, but also significantly address various issues highlighted in the Problem Statements section.

Functionalities

The interface of the Flint table above provides the following functionalities and semantics:

Automated Metadata Maintenance: With minimum user input, Flint table maintains essential information in its metadata, including source file lists, partition information and column statistics [Solved usability issue]
Semi-Structured Data Handling: Users can provide minimal schema information for semi-structured data, effectively addressing usability concerns related to schema definition. [Solved usability issue]
Validation and Compatibility: Flint table does comprehensive validation and ensure provided table option can function without issues [Solved compatibility issue]
Efficient Metadata Refresh: Flint table utilizes efficient approach to refresh its metadata, such as efficient S3 listing or SNS push mode [Solved performance issue]
Automatic Acceleration: Any direct query or covering index/MV refresh involving a Flint table benefits from automatic acceleration by leveraging metadata. This enhancement includes file listing, partition pruning and data skipping [Solved performance issue]

Future Scope

TODO

The text was updated successfully, but these errors were encountered:

penghuo · 2024-01-24T06:59:23Z

Related to semi-structured datatype, Spark 4.0 feature, Support Variant data type, https://issues.apache.org/jira/browse/SPARK-45891

dai-chen added the feature New feature label Jan 10, 2024

github-actions bot added the untriaged label Jan 10, 2024

dai-chen removed the untriaged label Jan 10, 2024

dai-chen added this to OpenSearch Spark Project Planning Jan 10, 2024

dai-chen moved this to Todo in OpenSearch Spark Project Planning Jan 10, 2024

dai-chen mentioned this issue Jan 10, 2024

[Feature] OpenSearch and Apache Spark Integration #3

Closed

dai-chen changed the title ~~[RFC] A Proposal for introducing new Flint Table concept~~ [RFC] Proposal: introducing new Flint Table concept Jan 10, 2024

dai-chen changed the title ~~[RFC] Proposal: introducing new Flint Table concept~~ [RFC] Proposal: introduce new Flint Table concept Jan 11, 2024

dai-chen removed the status in OpenSearch Spark Project Planning Mar 7, 2024

dai-chen mentioned this issue Jun 3, 2024

[FEATURE] Performance and Scalability Enhancements for Flint Index #365

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Proposal: introduce new Flint Table concept #217

[RFC] Proposal: introduce new Flint Table concept #217

dai-chen commented Jan 10, 2024 •

edited

Loading

penghuo commented Jan 24, 2024

[RFC] Proposal: introduce new Flint Table concept #217

[RFC] Proposal: introduce new Flint Table concept #217

Comments

dai-chen commented Jan 10, 2024 • edited Loading

Introduction

Problem Statement

Proposed Solution

SQL Interface [TBD]

Examples

Functionalities

Future Scope

penghuo commented Jan 24, 2024

dai-chen commented Jan 10, 2024 •

edited

Loading