-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Zero-ETL - Apache Iceberg Table Support #372
Comments
Proof of Concept 1: Basic Integration of Flint Core Functionality with Iceberg TablesObjectivesThis PoC aims to evaluate the integration of Flint's core functionalities, including skipping index, covering index and materialized views, with Apache Iceberg tables. The key questions to answer:
Demo QueryThis query demonstrates finding the top IP address pairs (source -> target) that have had their connections rejected in the past hour: -- Identify the top IP address pairs with rejected connections in the last hour
SELECT
src_endpoint.ip || '->' || dst_endpoint.ip AS ip_pair,
action,
COUNT(*) AS count
FROM vpc_flow_logs
WHERE action = 'REJECT'
AND time_dt > (current_timestamp - interval '1' hour)
GROUP BY 1, 2
ORDER BY count DESC
LIMIT 25; Indexing Support [2d]
Index Maintenance [2d]
Performance Benchmark [1d]
|
Proof of Concept 2: Advanced Integration of Flint with Iceberg TablesEnabling Advanced Search CapabilityThis part of the PoC explores the implementation of advanced search capabilities in Flint, integrated with Iceberg tables. Take full-text search capability for example, below is a demonstration query: -- Identify the number of HTTP status occurrences with requests containing 'Chrome' in the past hour
SELECT
status,
COUNT(*) AS count
FROM http_logs
WHERE MATCH(request, 'Chrome')
AND timestamp > (current_timestamp - interval '1' hour)
GROUP BY status; Changes required:
Is skipping index helpful in this case? Depending on latency requirement, it is possible to build skipping index such as:
Alternative to current enhanced covering index solution? Covering indexes provide full search and dashboard capabilities, but it require indexing of all involved columns. An alternative is a non-covering (filtering) index, which indexes only the columns used in filters, with each index entry pointing back to the row ID in the source table. To implement this, we need to answer the following questions:
Accelerating Iceberg Metadata QueriesThe following SQL examples illustrate how Flint can be leveraged to accelerate queries against Iceberg's metadata, which is essential for schema management and data governance: -- Example query to fetch historical metadata from an Iceberg table
TODO |
Proof of Concept Conclusion [WIP]Conclusion
|
High-Level Design and Task Breakdown
User ExperienceHere we outline the end-to-end user experience, demonstrating the steps involved from initial data exploration through advanced query optimization and table management. # Step 1: Data exploration SELECT src_endpoint, dst_endpoint, action FROM glue.iceberg.vpc_flow_logs -- limit size or sampling LIMIT 10; # Step 2: Zero-ETL by Flint index CREATE INDEX src_dst_action ON glue.iceberg.vpc_flow_logs ( src_endpoint, dst_endpoint, action ) WHERE timestamp > (current_timestamp - interval '1' hour) -- partial indexing WITH ( auto_refresh = true ); # Step 3a: Dashboard / DSL query Flint index directly POST flint_glue_iceberg_vpc_flow_logs_src_dst_action_index { ... } # Step 3b: SparkSQL query acceleration # Identify the top IP address pairs with rejected connections in the last hour SELECT src_endpoint.ip || '->' || dst_endpoint.ip AS ip_pair, action, COUNT(*) AS count FROM glue.iceberg.vpc_flow_logs WHERE action = 'REJECT' AND time_dt > (current_timestamp - interval '1' hour) GROUP BY 1, 2 ORDER BY count DESC LIMIT 25; # Step 4: Iceberg table management # Data compaction on a regular basis triggered manually or by Glue CALL local.system.rewrite_data_files( table => 'glue.iceberg.vpc_flow_logs', options => map('rewrite-all', true) ); # Step 5: Clean up # User deletes unused covering index after analytics DELETE INDEX src_dst_action ON glue.iceberg.vpc_flow_logs; VACUUM INDEX src_dst_action ON glue.iceberg.vpc_flow_logs; ArchitectureHere is the architecture diagram that provides a comprehensive overview of the high-level design and key components: Task BreakdownHere presents the high-level task breakdown, providing a description of each task and its respective components. Please find more detailed task descriptions in the following sections:
Data ExplorationUsers can execute common DDL statement and direct SQL queries on Iceberg tables for ad-hoc data analytics. Flint must support the Iceberg catalog and fully accommodate Iceberg data types, ensuring seamless integration and comprehensive data analysis capabilities. Iceberg CatalogConfigure Spark job parameters to activate the Iceberg catalog, ensuring compatibility with Iceberg Data TypesFlint must fully support all Iceberg data types, including complex structures like Struct, List, and Map, to ensure comprehensive data handling capabilities. Ref: https://iceberg.apache.org/docs/latest/spark-getting-started/#type-compatibility
Zero-ETL SupportUsers can load raw or aggregated data directly into OpenSearch via covering indexes and materialized views, enabling full-text search and dashboard capabilities without the need for an Extract, Transform, Load (ETL) process. Covering Index EnhancementAddressing limitations and improving the performance of covering indexes in issues below:
Materialized View EnhancementAddressing limitations of materialized views: SparkSQL Query AccelerationUsers continue to use the familiar SparkSQL interface and leverage OpenSearch's indexing capabilities to accelerate SparkSQL queries. Data Skipping
Query OptimizationSupport query rewriting for covering index (full or partial) and materialized view:
Index MaintenanceIndex Data FreshnessProvides tool for user to inspect index data freshness and ensure up-to-date query results:
Index Management
TestingFunctional TestingFunctional testing ensures Iceberg support works with all existing components and features, and newly added features perform correctly and meet the specified requirements. Performance BenchmarkBenchmarking performance for data exploration queries, zero-ETL ingestion, and SparkSQL query acceleration:
Issues related: |
Is your feature request related to a problem?
Apache Iceberg is designed for managing large analytic tables in a scalable and performant way, using features like schema evolution, partitioning, and metadata management to optimize query performance. Despite these robust optimizations, the inherent latency of querying large datasets directly from S3 can be a pain point, especially for real-time analytics and interactive querying scenarios, when running complex or frequently accessed queries on large Iceberg tables.
TODO: current problem statement is more technical. need more feedback from real Iceberg customer.
What solution would you like?
Integrate current Flint’s query acceleration features with Iceberg to enhance performance:
TODO: evaluate missing features in #367
What alternatives have you considered?
N/A
Do you have any additional context?
Known issues related:
The text was updated successfully, but these errors were encountered: