Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Object Storage (S3) Data Ingestion through Streaming Query #948

Open
17 of 43 tasks
dai-chen opened this issue Oct 21, 2022 · 1 comment
Open
17 of 43 tasks
Assignees
Labels
feature Meta Meta issue, not directly linked to a PR

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented Oct 21, 2022

Is your feature request related to a problem?
One of the key technical challenge in #719 is how to maintain the consistency between base table (S3 data) and derived table (OpenSearch index/materialized view).

What solution would you like?
One solution for the problem is to refresh new data from S3 to OpenSearch incrementally. We are proposing to enhance our query engine by unifying the batch processing and stream processing capability in single architecture as existing solution in Apache Flink and Spark. In particular, the enhancement includes changes in query planning, query execution engine and query plan itself.

PoC branch: https://github.com/opensearch-project/sql/tree/poc/maximus-m1. User manual and design doc in details will be published later as planned below.

What alternatives have you considered?
The alternative solution is rebuild the derived table (full refresh) on user demand or regular basis. This can be done by current batch processing architecture, however, introduce significant overhead for large S3 dataset it will.

Do you have any additional context?

Phase 1

Goal:

  • Ready for performance evaluation
  • Ready for feature evaluation
  • Missing
    • Failure recovery
    • Security

Tasks

Phase 2

Goal:

  • Ready for experimental release
  • Missing
    • Pipeline Execution
    • Distributed Execution

Tasks

Phase 3

Goal:

  • Ready for production deployment

Tasks

  • Pipeline Execution
  • Distributed Execution
@dai-chen dai-chen added the enhancement New feature or request label Oct 21, 2022
@dai-chen dai-chen self-assigned this Oct 21, 2022
@dai-chen dai-chen added the Meta Meta issue, not directly linked to a PR label Oct 21, 2022
@dai-chen dai-chen added this to the Maximus - M1 milestone Oct 21, 2022
@penghuo penghuo changed the title [FEATURE] Query plan enhancement for stream processing [FEATURE] Object Storage (S3) Data Ingestion through Streaming Query Oct 21, 2022
@penghuo penghuo removed this from the Maximus - M1 milestone Oct 21, 2022
@dai-chen dai-chen added feature and removed enhancement New feature or request labels Oct 24, 2022
@dai-chen
Copy link
Collaborator Author

The work here is suspended due to ongoing research in opensearch-project/opensearch-spark#4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Meta Meta issue, not directly linked to a PR
Development

No branches or pull requests

4 participants