Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Proposal: introduce new Flint Table concept #217

Open
dai-chen opened this issue Jan 10, 2024 · 1 comment
Open

[RFC] Proposal: introduce new Flint Table concept #217

dai-chen opened this issue Jan 10, 2024 · 1 comment
Labels
feature New feature

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented Jan 10, 2024

Introduction

Currently the fundamental component accessible to users is the Flint Index. In essence, a Flint index represents a derived dataset sourced along with its metadata log for index state management. Physically, the data associated with each Flint index resides within an individual OpenSearch index, ex. flint_{table_name}_skipping_index, while its metadata log is stored as a single document in the shared .query_request_index_{datasource} index.

The current user workflow involves initially creating a Spark table or utilizing existing ones. Subsequently, a Flint index is created on this table. Taking the example of a skipping index, the process is illustrated below:

CREATE TABLE myglue.default.alb_logs
(
    type string,
    time string,
    elb string,
    client_ip string,
    ...,
)
USING JSON
OPTIONS (
  path 's3://alb_logs/'
);

CREATE SKIPPING INDEX ON myglue.default.alb_logs
(
  year PARTITION,
  month PARTITION,
  day PARTITION,
  request_processing_time MIN_MAX,
  client_ip VALUE_SET
)
WITH ( auto_refresh = true );

The utilization of Flint indexes empowers users to opt for various optimization techniques such as skipping index, covering index, and materialized view creation, all aimed at enhancing data retrieval speed from the source table.

Problem Statement

The existing Flint index and user workflow exhibit the following challenges:

Proposed Solution

In this document, our proposal focuses on the introduction of the Flint Table, a higher-level abstraction compared to the Flint Index. Beyond encompassing table schema information and data present in other Spark tables, a Flint table incorporates a new Table Metadata concept enhanced by leveraging the Flint skipping index.

TBD: 1) register metadata as SQL table; 2) covering index/MV under same Flint table concept or not

Screenshot 2024-01-10 at 11 12 24 AM

SQL Interface [TBD]

CREATE TABLE flint_table_name
( column_list )
USING Flint
OPTIONS (
  flint_table_options
);

Examples

Now user can simply create a Flint table and immediately initiate queries on it:

CREATE TABLE myglue.default.alb_logs
(
    type string,
    time string,
    elb string,
    client_ip string,
    ...,
)
USING Flint
OPTIONS (
  path='s3://alb_logs/',
  projection='s3://alb_logs/year=${year}/month=${month}/day=${day}' # partition projection
)
PARTITIONED BY (year, month, day);

For semi-structured data, users can effortlessly create a Flint table, specifying only the columns they intend to use in their subsequent queries:

CREATE TABLE app_logs
(
  @timestamp datetime,
  @message string # field capture entire log line
)
...

SELECT ... FROM alb_logs WHERE status = 200 # implicit @message

Currently assume there is no changes on covering index and materialized view side:

CREATE INDEX elb_status_and_clientip_idx
ON myglue.default.alb_logs
( elb_status_code, client_ip )
WHERE year = 2023 AND elb_status_code = 400
WITH (auto_refresh = true);

CREATE MATERIALIZED VIEW alb_logs_mv
AS
  SELECT window.start_time, COUNT(*)
  FROM myglue.default.alb_logs
  WHERE year = 2023 AND elb_status_code = 400
  GROUP BY TUMBLE(timestamp, 'interval')
WITH (auto_refresh = true);

While the immediate enhancement is the simplified process eliminating the step of creating a Flint index explicitly, the actual influence extends far beyond. As discussed in the upcoming Functionalities section, the semantics of the proposed Flint Table solution introduces a range of features. These features not only streamline the workflow, but also significantly address various issues highlighted in the Problem Statements section.

Functionalities

The interface of the Flint table above provides the following functionalities and semantics:

  1. Automated Metadata Maintenance: With minimum user input, Flint table maintains essential information in its metadata, including source file lists, partition information and column statistics [Solved usability issue]
  2. Semi-Structured Data Handling: Users can provide minimal schema information for semi-structured data, effectively addressing usability concerns related to schema definition. [Solved usability issue]
  3. Validation and Compatibility: Flint table does comprehensive validation and ensure provided table option can function without issues [Solved compatibility issue]
  4. Efficient Metadata Refresh: Flint table utilizes efficient approach to refresh its metadata, such as efficient S3 listing or SNS push mode [Solved performance issue]
  5. Automatic Acceleration: Any direct query or covering index/MV refresh involving a Flint table benefits from automatic acceleration by leveraging metadata. This enhancement includes file listing, partition pruning and data skipping [Solved performance issue]

Future Scope

TODO

@dai-chen dai-chen added the feature New feature label Jan 10, 2024
@dai-chen dai-chen changed the title [RFC] A Proposal for introducing new Flint Table concept [RFC] Proposal: introducing new Flint Table concept Jan 10, 2024
@dai-chen dai-chen changed the title [RFC] Proposal: introducing new Flint Table concept [RFC] Proposal: introduce new Flint Table concept Jan 11, 2024
@penghuo
Copy link
Collaborator

penghuo commented Jan 24, 2024

Related to semi-structured datatype, Spark 4.0 feature, Support Variant data type, https://issues.apache.org/jira/browse/SPARK-45891

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature
Projects
Development

No branches or pull requests

2 participants