[RFC] High Level Vision of OpenSearch SQL #2674

penghuo · 2024-05-15T15:16:29Z

1.1. Current State

OpenSearch SQL currently faces several challenges. Notably, there is no comprehensive OpenSearch SQL specification, causing discrepancies with the widely-adopted Spark SQL. This lack of standardization hampers seamless integration and interoperability between different SQL dialects and data processing engines. Additionally, the absence of a unified interface complicates application development and maintenance, with plugins/sql used for OpenSearch SQL on index requests, and async_sql for Spark SQL.
Key modern data analytics features like search, vector_search, and geo_search are missing, limiting OpenSearch SQL's capabilities in handling advanced text, vector, and spatial data processing. The lack of a high-throughput interface for database integration (e.g., with Spark, Presto) can severely affect query performance and scalability, especially in data-intensive environments.
Furthermore, user-created derived datasets, such as skipping index and covering index, are not consolidated within the same catalog, leading to inconsistencies and potential data silos. Also, as raw data sources can vary significantly in format, this can detrimentally affect query performance and introduce unpredictability. Optimizations specific to data formats or structures are essential to address these issues.

1.2. Future Vision

1.2.1. OpenSearch SQL

Looking ahead, OpenSearch SQL should be a first-class query language for OpenSearch. The OpenSearch SQL will feature a unified _sql REST interface. Enhancing the current ANSI SQL grammar, new capabilities such as search, vector_search, and geo_search will be incorporated, allowing users to fully leverage OpenSearch's search functionalities within a familiar SQL syntax.

1.2.2. OpenSearch SQL on DataLake

OpenSearch aims to offer a unified and robust OpenSearch SQL language that will enable seamless querying and searching across data stored in OpenSearch and external data sources. The SQL client library will be upgraded to use more efficient data formats (Protobuf, Apache Arrow), providing a high-performance, columnar, and language-agnostic standard for database integrations. This ensures rapid data processing and broad interoperability across various systems and programming languages.
An OpenSearch-enhanced Iceberg-compatible table format, termed OpenSearch Table, is also proposed. It enhances Iceberg by incorporating index capabilities, enabling the creation of search indexes on text fields, vector indexes, and geo indexes. During query execution, these indexes will be automatically utilized to optimize full-text, neural, and geo searches. Initially, this feature may be based on a customized Iceberg version, fully compatible with Iceberg and termed OpenSearch Table. As this feature integrates into the mainstream Iceberg, it will become available for all query engines.

1.3. High level architecture

Overview

Client: Represents the user initiating SQL queries.
Databases: Consists of a variety of databases including Flink, Presto, and Redshift, highlighting the diverse data sources available for querying.

Detailed Flow

SQL Interface: Clients send SQL queries to the databases (Flink, Presto, Redshift). These queries are subsequently routed to OpenSearch integrated with Spark for processing.
Data Storage:
- Managed Dataset OpenSearch Table S3: This is where data is both written to and read from a managed dataset stored in the OpenSearch Table format on S3.
- Unmanaged Dataset S3: Serves as an alternative storage option, where data is directly accessed from an unmanaged dataset on S3.
Data Ingestion:
- DataPrepper: Prepares and processes data before it is written to OpenSearch.
- External: Existing engines can directly write data to S3 by leveraging the OpenSearch Table library, which is compatible with Iceberg.
Data Querying:
- OpenSearch with Spark: Here, the SQL query is processed in the OpenSearch coordination node. The parser module parses the SQL query and translates it into an Abstract Syntax Tree (AST). We can continue to use the current SQL parser (MySQL dialect) and enhance it as needed. There are multiple choices for LogicalPlan, including substrait, the existing SQL Plugin, and Calcite. A runtime selector should then choose the appropriate runtime to execute the logical plan. This selector could utilize a rule-based solution to determine the optimal runtime.
- Interaction with Databases: The OpenSearch Table can be used or stored back across the various databases, maintaining a cycle of data use and optimization.

penghuo added enhancement New feature or request untriaged and removed untriaged labels May 15, 2024

github-actions bot added the untriaged label May 15, 2024

penghuo removed the untriaged label May 15, 2024

penghuo changed the title ~~[RFC] - High Level Vision of OpenSearch SQL~~ [RFC] High Level Vision of OpenSearch SQL Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] High Level Vision of OpenSearch SQL #2674

[RFC] High Level Vision of OpenSearch SQL #2674

penghuo commented May 15, 2024 •

edited

Loading

[RFC] High Level Vision of OpenSearch SQL #2674

[RFC] High Level Vision of OpenSearch SQL #2674

Comments

penghuo commented May 15, 2024 • edited Loading

1.1. Current State

1.2. Future Vision

1.2.1. OpenSearch SQL

1.2.2. OpenSearch SQL on DataLake

1.3. High level architecture

Overview

Detailed Flow

penghuo commented May 15, 2024 •

edited

Loading