You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the fundamental component accessible to users is the Flint Index. In essence, a Flint index represents a derived dataset sourced along with its metadata log for index state management. Physically, the data associated with each Flint index resides within an individual OpenSearch index, ex. flint_{table_name}_skipping_index, while its metadata log is stored as a single document in the shared .query_request_index_{datasource} index.
The current user workflow involves initially creating a Spark table or utilizing existing ones. Subsequently, a Flint index is created on this table. Taking the example of a skipping index, the process is illustrated below:
CREATE TABLE myglue.default.alb_logs
(
type string,
time string,
elb string,
client_ip string,
...,
)
USING JSON
OPTIONS (
path 's3://alb_logs/'
);
CREATE SKIPPING INDEX ON myglue.default.alb_logs
(
year PARTITION,
month PARTITION,
day PARTITION,
request_processing_time MIN_MAX,
client_ip VALUE_SET
)
WITH ( auto_refresh = true );
The utilization of Flint indexes empowers users to opt for various optimization techniques such as skipping index, covering index, and materialized view creation, all aimed at enhancing data retrieval speed from the source table.
Problem Statement
The existing Flint index and user workflow exhibit the following challenges:
Users often lack guidance in selecting the appropriate type of Flint index and skipping algorithm to create.
For semi-structured data, users may not know the schema until query time. The current need to create a table with all columns causes inconvenience.
Proposed Solution
In this document, our proposal focuses on the introduction of the Flint Table, a higher-level abstraction compared to the Flint Index. Beyond encompassing table schema information and data present in other Spark tables, a Flint table incorporates a new Table Metadata concept enhanced by leveraging the Flint skipping index.
TBD: 1) register metadata as SQL table; 2) covering index/MV under same Flint table concept or not
Now user can simply create a Flint table and immediately initiate queries on it:
CREATE TABLE myglue.default.alb_logs
(
type string,
time string,
elb string,
client_ip string,
...,
)
USING Flint
OPTIONS (
path='s3://alb_logs/',
projection='s3://alb_logs/year=${year}/month=${month}/day=${day}' # partition projection
)
PARTITIONED BY (year, month, day);
For semi-structured data, users can effortlessly create a Flint table, specifying only the columns they intend to use in their subsequent queries:
CREATE TABLE app_logs
(
@timestamp datetime,
@message string # field capture entire log line
)
...
SELECT ... FROM alb_logs WHERE status = 200 # implicit @message
Currently assume there is no changes on covering index and materialized view side:
CREATE INDEX elb_status_and_clientip_idx
ON myglue.default.alb_logs
( elb_status_code, client_ip )
WHERE year = 2023 AND elb_status_code = 400
WITH (auto_refresh = true);
CREATE MATERIALIZED VIEW alb_logs_mv
AS
SELECT window.start_time, COUNT(*)
FROM myglue.default.alb_logs
WHERE year = 2023 AND elb_status_code = 400
GROUP BY TUMBLE(timestamp, 'interval')
WITH (auto_refresh = true);
While the immediate enhancement is the simplified process eliminating the step of creating a Flint index explicitly, the actual influence extends far beyond. As discussed in the upcoming Functionalities section, the semantics of the proposed Flint Table solution introduces a range of features. These features not only streamline the workflow, but also significantly address various issues highlighted in the Problem Statements section.
Functionalities
The interface of the Flint table above provides the following functionalities and semantics:
Automated Metadata Maintenance: With minimum user input, Flint table maintains essential information in its metadata, including source file lists, partition information and column statistics [Solved usability issue]
Semi-Structured Data Handling: Users can provide minimal schema information for semi-structured data, effectively addressing usability concerns related to schema definition. [Solved usability issue]
Validation and Compatibility: Flint table does comprehensive validation and ensure provided table option can function without issues [Solved compatibility issue]
Efficient Metadata Refresh: Flint table utilizes efficient approach to refresh its metadata, such as efficient S3 listing or SNS push mode [Solved performance issue]
Automatic Acceleration: Any direct query or covering index/MV refresh involving a Flint table benefits from automatic acceleration by leveraging metadata. This enhancement includes file listing, partition pruning and data skipping [Solved performance issue]
Future Scope
TODO
The text was updated successfully, but these errors were encountered:
Introduction
Currently the fundamental component accessible to users is the Flint Index. In essence, a Flint index represents a derived dataset sourced along with its metadata log for index state management. Physically, the data associated with each Flint index resides within an individual OpenSearch index, ex.
flint_{table_name}_skipping_index
, while its metadata log is stored as a single document in the shared.query_request_index_{datasource}
index.The current user workflow involves initially creating a Spark table or utilizing existing ones. Subsequently, a Flint index is created on this table. Taking the example of a skipping index, the process is illustrated below:
The utilization of Flint indexes empowers users to opt for various optimization techniques such as skipping index, covering index, and materialized view creation, all aimed at enhancing data retrieval speed from the source table.
Problem Statement
The existing Flint index and user workflow exhibit the following challenges:
Performance
Compatibility
Usability
Proposed Solution
In this document, our proposal focuses on the introduction of the Flint Table, a higher-level abstraction compared to the Flint Index. Beyond encompassing table schema information and data present in other Spark tables, a Flint table incorporates a new Table Metadata concept enhanced by leveraging the Flint skipping index.
TBD: 1) register metadata as SQL table; 2) covering index/MV under same Flint table concept or not
SQL Interface [TBD]
Examples
Now user can simply create a Flint table and immediately initiate queries on it:
For semi-structured data, users can effortlessly create a Flint table, specifying only the columns they intend to use in their subsequent queries:
Currently assume there is no changes on covering index and materialized view side:
While the immediate enhancement is the simplified process eliminating the step of creating a Flint index explicitly, the actual influence extends far beyond. As discussed in the upcoming Functionalities section, the semantics of the proposed Flint Table solution introduces a range of features. These features not only streamline the workflow, but also significantly address various issues highlighted in the Problem Statements section.
Functionalities
The interface of the Flint table above provides the following functionalities and semantics:
Future Scope
TODO
The text was updated successfully, but these errors were encountered: