Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Zero-ETL - Spark streaming job computation cost reduction #196

Open
dai-chen opened this issue Dec 13, 2023 · 0 comments
Open

[EPIC] Zero-ETL - Spark streaming job computation cost reduction #196

dai-chen opened this issue Dec 13, 2023 · 0 comments
Labels
Meta Meta issue, not directly linked to a PR

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented Dec 13, 2023

Is your feature request related to a problem?

Currently, Streaming job will be ever running even if there is no data to be updated for each index. i.e. 1 executor and 1 driver would be ever running. This results in wastage of resources.

Out of Scope

This epic issue is only focus on reducing unnecessary cost incurred while streaming job idle (no new source data to process). Other topics related but won't be discussed here:

  • New files discover optimization by efficient S3 list or SNS notification
  • Resource usage reduction by dynamic resource allocation (DRA)
  • Computation result storage cost reduction by Flint index on S3

What solution would you like?

Proposed Solutions

The proposed items as follows try to solve the problem from different perspective:

  1. Enable user to stop/restart on demand: Reduce streaming job duration
  2. Share Spark driver node with Flint REPL job: Avoid dedicated driver node
  3. On-demand index build: Eliminate reliance on streaming job entirely
Proposed Items Description Size Comment
Enable user to
stop/restart on demand
User can control how long the streaming job runs by:

1. Job management API, such as SHOW/CANCEL for long running job: #190
2. Streaming job with AvailableNow trigger: #195

User triggers each job restart on UI, scheduler, SNS notification or Glue crawler event etc.
Small Provided API to serve different use cases. User can opt-in and then trigger as often as needed.
Share Spark driver node
with Flint REPL job
Currently streaming job is handled by separate code path in Spark which can be merged with and handled by Flint REPL job code. Med Major code refactoring required
On-demand index build User can choose to trigger index build at query time on-demand.
This removes the requirement on Spark streaming job: #118 (comment)
Large This serves specific use cases mentioned in issue.

Preferred Solution

As an initial step, addressing the tasks outlined in #195 in the first proposed item is a good starting point. In the short term, its manageable implementation size, moderate difficulty, and adaptable nature make it well-suited to accommodate various use cases.

Other impact/benefits:

TODO: evaluate from streaming job monitoring, deployment and REPL session aspect.

@dai-chen dai-chen added the Meta Meta issue, not directly linked to a PR label Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta Meta issue, not directly linked to a PR
Projects
None yet
Development

No branches or pull requests

1 participant