[EPIC] Zero-ETL - Spark streaming job computation cost reduction #196

dai-chen · 2023-12-13T22:32:53Z

Is your feature request related to a problem?

Currently, Streaming job will be ever running even if there is no data to be updated for each index. i.e. 1 executor and 1 driver would be ever running. This results in wastage of resources.

Out of Scope

This epic issue is only focus on reducing unnecessary cost incurred while streaming job idle (no new source data to process). Other topics related but won't be discussed here:

New files discover optimization by efficient S3 list or SNS notification
Resource usage reduction by dynamic resource allocation (DRA)
Computation result storage cost reduction by Flint index on S3

What solution would you like?

Proposed Solutions

The proposed items as follows try to solve the problem from different perspective:

Enable user to stop/restart on demand: Reduce streaming job duration
Share Spark driver node with Flint REPL job: Avoid dedicated driver node
On-demand index build: Eliminate reliance on streaming job entirely

Proposed Items	Description	Size	Comment
Enable user to stop/restart on demand	User can control how long the streaming job runs by: 1. Job management API, such as SHOW/CANCEL for long running job: #190 2. Streaming job with AvailableNow trigger: #195 User triggers each job restart on UI, scheduler, SNS notification or Glue crawler event etc.	Small	Provided API to serve different use cases. User can opt-in and then trigger as often as needed.
Share Spark driver node with Flint REPL job	Currently streaming job is handled by separate code path in Spark which can be merged with and handled by Flint REPL job code.	Med	Major code refactoring required
On-demand index build	User can choose to trigger index build at query time on-demand. This removes the requirement on Spark streaming job: #118 (comment)	Large	This serves specific use cases mentioned in issue.

Preferred Solution

As an initial step, addressing the tasks outlined in #195 in the first proposed item is a good starting point. In the short term, its manageable implementation size, moderate difficulty, and adaptable nature make it well-suited to accommodate various use cases.

Other impact/benefits:

TODO: evaluate from streaming job monitoring, deployment and REPL session aspect.

dai-chen added the Meta Meta issue, not directly linked to a PR label Dec 13, 2023

github-actions bot added the untriaged label Dec 13, 2023

dai-chen removed the untriaged label Dec 13, 2023

dai-chen mentioned this issue Dec 13, 2023

[Feature] OpenSearch and Apache Spark Integration #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Zero-ETL - Spark streaming job computation cost reduction #196

[EPIC] Zero-ETL - Spark streaming job computation cost reduction #196

dai-chen commented Dec 13, 2023 •

edited

Loading

[EPIC] Zero-ETL - Spark streaming job computation cost reduction #196

[EPIC] Zero-ETL - Spark streaming job computation cost reduction #196

Comments

dai-chen commented Dec 13, 2023 • edited Loading

Out of Scope

Proposed Solutions

Preferred Solution

dai-chen commented Dec 13, 2023 •

edited

Loading