You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i also had written this out somewhere... but basically, we're not keeping record of our last_run_ts. i had mentioned in the past that we'll eventually need to track start/end timestamp for the GQL/ETL workflows...
Run 1
Run 2
Run 3
Time
1:00
2:00
3:00
to initially solve this and move us forward (like ohlcv data factory), i wrote some logic such that we just query this data from our tables in order to identify the last_run_checkpoint.
to solve this in the past, i had proposed to simply mutating my_ppss.yaml such that lake st_ts is modified w/ the last_run_timestamp and the pipeline enforces being incremental rather than trying to do it through data-inference ... but, this kind of breaks the pattern for how the yaml file is being used (engine modifying it).
further, it doesn't provide a way to track the etl/workflow runs, such that they can be rolledback/operated in a systemic way
Proposed solution (a)
start tracking the jobs metadata inside a jobs table in duckdb:
idiom-bytes
changed the title
[Lake][Incremental Pipeline] Save workflow metadata into duckdb such that we can more systemically operate the workflow
[Lake][Jobs][Incremental Pipeline] Use jobs metadata on duckdb to manage all workflows
May 1, 2024
idiom-bytes
changed the title
[Lake][Jobs][Incremental Pipeline] Use jobs metadata on duckdb to manage all workflows
[Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data.
May 1, 2024
Problem/motivation
i also had written this out somewhere... but basically, we're not keeping record of our last_run_ts. i had mentioned in the past that we'll eventually need to track start/end timestamp for the GQL/ETL workflows...
to initially solve this and move us forward (like ohlcv data factory), i wrote some logic such that we just query this data from our tables in order to identify the last_run_checkpoint.
to solve this in the past, i had proposed to simply mutating my_ppss.yaml such that
lake st_ts
is modified w/ the last_run_timestamp and the pipeline enforces being incremental rather than trying to do it through data-inference ... but, this kind of breaks the pattern for how the yaml file is being used (engine modifying it).further, it doesn't provide a way to track the etl/workflow runs, such that they can be rolledback/operated in a systemic way
Proposed solution (a)
start tracking the jobs metadata inside a jobs table in duckdb:
and use this data to understand how to resume/rollback/operate jobs on the lake
Proposed solution (b)
Just modify the ppss.yaml
lake.st_ts
when the job ends.If the user runs a CLI command to roll back the pipeline,
lake.st_ts
should also be updated to reflect the state of the lake.This is a KISS solution that lets us keep the SLA small.
Current solution (c)
Use min & max from pdr_predictions as the checkpoint for where the data has-been-written-to.
All other tables should be fetching and updating from this marker.
DoD:
Save ETL & workflow metadata to operate the lake.
Tasks
The text was updated successfully, but these errors were encountered: