[FEATURE] Object Storage (S3) Data Ingestion through Streaming Query #948

dai-chen · 2022-10-21T17:04:05Z

Is your feature request related to a problem?
One of the key technical challenge in #719 is how to maintain the consistency between base table (S3 data) and derived table (OpenSearch index/materialized view).

What solution would you like?
One solution for the problem is to refresh new data from S3 to OpenSearch incrementally. We are proposing to enhance our query engine by unifying the batch processing and stream processing capability in single architecture as existing solution in Apache Flink and Spark. In particular, the enhancement includes changes in query planning, query execution engine and query plan itself.

PoC branch: https://github.com/opensearch-project/sql/tree/poc/maximus-m1. User manual and design doc in details will be published later as planned below.

What alternatives have you considered?
The alternative solution is rebuild the derived table (full refresh) on user demand or regular basis. This can be done by current batch processing architecture, however, introduce significant overhead for large S3 dataset it will.

Do you have any additional context?

Phase 1

Goal:

Ready for performance evaluation
Ready for feature evaluation
Missing
- Failure recovery
- Security

Tasks

Phase 2

Goal:

Ready for experimental release
Missing
- Pipeline Execution
- Distributed Execution

Tasks

Phase 3

Goal:

Ready for production deployment

Tasks

Pipeline Execution
Distributed Execution

dai-chen · 2023-01-24T18:01:55Z

The work here is suspended due to ongoing research in opensearch-project/opensearch-spark#4.

dai-chen added the enhancement New feature or request label Oct 21, 2022

dai-chen self-assigned this Oct 21, 2022

dai-chen added the Meta Meta issue, not directly linked to a PR label Oct 21, 2022

dai-chen added this to the Maximus - M1 milestone Oct 21, 2022

penghuo changed the title ~~[FEATURE] Query plan enhancement for stream processing~~ [FEATURE] Object Storage (S3) Data Ingestion through Streaming Query Oct 21, 2022

penghuo added this to Object Storage (S3) Data Ingestion through Streaming Query Oct 21, 2022

penghuo removed this from the Maximus - M1 milestone Oct 21, 2022

penghuo assigned penghuo, joshuali925, dai-chen and vmmusings and unassigned penghuo, joshuali925 and dai-chen Oct 21, 2022

dai-chen added feature and removed enhancement New feature or request labels Oct 24, 2022

dai-chen mentioned this issue Nov 19, 2022

Improve pushdown optimization and logical to physical transformation #1091

Merged

6 tasks

dai-chen moved this to In Progress in Object Storage (S3) Data Ingestion through Streaming Query Nov 22, 2022

penghuo mentioned this issue Nov 29, 2022

[RFC] OpenSearch and Apache Spark Integration opensearch-project/opensearch-spark#4

Open

penghuo mentioned this issue Nov 4, 2023

[FEATURE] Support COPY operation opensearch-project/opensearch-spark#129

Open

penghuo mentioned this issue Jul 17, 2023

[RFC] OpenSearch and Apache Spark Integration #1875

Open

salyh mentioned this issue Sep 11, 2024

[DOC] Misleading and unclear documentation for the Spark Connector in the SQL/PPL docs opensearch-project/documentation-website#8212

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Object Storage (S3) Data Ingestion through Streaming Query #948

[FEATURE] Object Storage (S3) Data Ingestion through Streaming Query #948

dai-chen commented Oct 21, 2022 •

edited

Loading

dai-chen commented Jan 24, 2023

[FEATURE] Object Storage (S3) Data Ingestion through Streaming Query #948

[FEATURE] Object Storage (S3) Data Ingestion through Streaming Query #948

Comments

dai-chen commented Oct 21, 2022 • edited Loading

Phase 1

Goal:

Tasks

Phase 2

Goal:

Tasks

Phase 3

Goal:

Tasks

dai-chen commented Jan 24, 2023

dai-chen commented Oct 21, 2022 •

edited

Loading