You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, Streaming job will be ever running even if there is no data to be updated for each index. i.e. 1 executor and 1 driver would be ever running. This results in wastage of resources.
Out of Scope
This epic issue is only focus on reducing unnecessary cost incurred while streaming job idle (no new source data to process). Other topics related but won't be discussed here:
New files discover optimization by efficient S3 list or SNS notification
Resource usage reduction by dynamic resource allocation (DRA)
Computation result storage cost reduction by Flint index on S3
What solution would you like?
Proposed Solutions
The proposed items as follows try to solve the problem from different perspective:
Enable user to stop/restart on demand: Reduce streaming job duration
On-demand index build: Eliminate reliance on streaming job entirely
Proposed Items
Description
Size
Comment
Enable user to stop/restart on demand
User can control how long the streaming job runs by:
1. Job management API, such as SHOW/CANCEL for long running job: #190 2. Streaming job with AvailableNow trigger: #195
User triggers each job restart on UI, scheduler, SNS notification or Glue crawler event etc.
Small
Provided API to serve different use cases. User can opt-in and then trigger as often as needed.
Share Spark driver node with Flint REPL job
Currently streaming job is handled by separate code path in Spark which can be merged with and handled by Flint REPL job code.
Med
Major code refactoring required
On-demand index build
User can choose to trigger index build at query time on-demand. This removes the requirement on Spark streaming job: #118 (comment)
Large
This serves specific use cases mentioned in issue.
Preferred Solution
As an initial step, addressing the tasks outlined in #195 in the first proposed item is a good starting point. In the short term, its manageable implementation size, moderate difficulty, and adaptable nature make it well-suited to accommodate various use cases.
Other impact/benefits:
TODO: evaluate from streaming job monitoring, deployment and REPL session aspect.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
Currently, Streaming job will be ever running even if there is no data to be updated for each index. i.e. 1 executor and 1 driver would be ever running. This results in wastage of resources.
Out of Scope
This epic issue is only focus on reducing unnecessary cost incurred while streaming job idle (no new source data to process). Other topics related but won't be discussed here:
What solution would you like?
Proposed Solutions
The proposed items as follows try to solve the problem from different perspective:
stop/restart on demand
1. Job management API, such as SHOW/CANCEL for long running job: #190
2. Streaming job with AvailableNow trigger: #195
User triggers each job restart on UI, scheduler, SNS notification or Glue crawler event etc.
with Flint REPL job
This removes the requirement on Spark streaming job: #118 (comment)
Preferred Solution
As an initial step, addressing the tasks outlined in #195 in the first proposed item is a good starting point. In the short term, its manageable implementation size, moderate difficulty, and adaptable nature make it well-suited to accommodate various use cases.
Other impact/benefits:
TODO: evaluate from streaming job monitoring, deployment and REPL session aspect.
The text was updated successfully, but these errors were encountered: