Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] High Level Vision of OpenSearch SQL #2674

Open
penghuo opened this issue May 15, 2024 · 0 comments
Open

[RFC] High Level Vision of OpenSearch SQL #2674

penghuo opened this issue May 15, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@penghuo
Copy link
Collaborator

penghuo commented May 15, 2024

1.1. Current State

OpenSearch SQL currently faces several challenges. Notably, there is no comprehensive OpenSearch SQL specification, causing discrepancies with the widely-adopted Spark SQL. This lack of standardization hampers seamless integration and interoperability between different SQL dialects and data processing engines. Additionally, the absence of a unified interface complicates application development and maintenance, with plugins/sql used for OpenSearch SQL on index requests, and async_sql for Spark SQL.
Key modern data analytics features like search, vector_search, and geo_search are missing, limiting OpenSearch SQL's capabilities in handling advanced text, vector, and spatial data processing. The lack of a high-throughput interface for database integration (e.g., with Spark, Presto) can severely affect query performance and scalability, especially in data-intensive environments.
Furthermore, user-created derived datasets, such as skipping index and covering index, are not consolidated within the same catalog, leading to inconsistencies and potential data silos. Also, as raw data sources can vary significantly in format, this can detrimentally affect query performance and introduce unpredictability. Optimizations specific to data formats or structures are essential to address these issues.

1.2. Future Vision

1.2.1. OpenSearch SQL

Looking ahead, OpenSearch SQL should be a first-class query language for OpenSearch. The OpenSearch SQL will feature a unified _sql REST interface. Enhancing the current ANSI SQL grammar, new capabilities such as search, vector_search, and geo_search will be incorporated, allowing users to fully leverage OpenSearch's search functionalities within a familiar SQL syntax.

1.2.2. OpenSearch SQL on DataLake

OpenSearch aims to offer a unified and robust OpenSearch SQL language that will enable seamless querying and searching across data stored in OpenSearch and external data sources. The SQL client library will be upgraded to use more efficient data formats (Protobuf, Apache Arrow), providing a high-performance, columnar, and language-agnostic standard for database integrations. This ensures rapid data processing and broad interoperability across various systems and programming languages.
An OpenSearch-enhanced Iceberg-compatible table format, termed OpenSearch Table, is also proposed. It enhances Iceberg by incorporating index capabilities, enabling the creation of search indexes on text fields, vector indexes, and geo indexes. During query execution, these indexes will be automatically utilized to optimize full-text, neural, and geo searches. Initially, this feature may be based on a customized Iceberg version, fully compatible with Iceberg and termed OpenSearch Table. As this feature integrates into the mainstream Iceberg, it will become available for all query engines.

1.3. High level architecture

Overview

  • Client: Represents the user initiating SQL queries.
  • Databases: Consists of a variety of databases including Flink, Presto, and Redshift, highlighting the diverse data sources available for querying.

Detailed Flow

  • SQL Interface: Clients send SQL queries to the databases (Flink, Presto, Redshift). These queries are subsequently routed to OpenSearch integrated with Spark for processing.
  • Data Storage:
    • Managed Dataset OpenSearch Table S3: This is where data is both written to and read from a managed dataset stored in the OpenSearch Table format on S3.
    • Unmanaged Dataset S3: Serves as an alternative storage option, where data is directly accessed from an unmanaged dataset on S3.
  • Data Ingestion:
    • DataPrepper: Prepares and processes data before it is written to OpenSearch.
    • External: Existing engines can directly write data to S3 by leveraging the OpenSearch Table library, which is compatible with Iceberg.
  • Data Querying:
    • OpenSearch with Spark: Here, the SQL query is processed in the OpenSearch coordination node. The parser module parses the SQL query and translates it into an Abstract Syntax Tree (AST). We can continue to use the current SQL parser (MySQL dialect) and enhance it as needed. There are multiple choices for LogicalPlan, including substrait, the existing SQL Plugin, and Calcite. A runtime selector should then choose the appropriate runtime to execute the logical plan. This selector could utilize a rule-based solution to determine the optimal runtime.
    • Interaction with Databases: The OpenSearch Table can be used or stored back across the various databases, maintaining a cycle of data use and optimization.

Screenshot 2024-05-13 at 10 05 28 AM

@penghuo penghuo added enhancement New feature or request untriaged and removed untriaged labels May 15, 2024
@penghuo penghuo removed the untriaged label May 15, 2024
@penghuo penghuo changed the title [RFC] - High Level Vision of OpenSearch SQL [RFC] High Level Vision of OpenSearch SQL Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant