[Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data. #983

idiom-bytes · 2024-05-01T15:56:29Z

Problem/motivation

i also had written this out somewhere... but basically, we're not keeping record of our last_run_ts. i had mentioned in the past that we'll eventually need to track start/end timestamp for the GQL/ETL workflows...

	Run 1	Run 2	Run 3
Time	1:00	2:00	3:00

to initially solve this and move us forward (like ohlcv data factory), i wrote some logic such that we just query this data from our tables in order to identify the last_run_checkpoint.

to solve this in the past, i had proposed to simply mutating my_ppss.yaml such that lake st_ts is modified w/ the last_run_timestamp and the pipeline enforces being incremental rather than trying to do it through data-inference ... but, this kind of breaks the pattern for how the yaml file is being used (engine modifying it).

further, it doesn't provide a way to track the etl/workflow runs, such that they can be rolledback/operated in a systemic way

Proposed solution (a)

start tracking the jobs metadata inside a jobs table in duckdb:

id
job_name
job start ts
job end ts
input metadata (filename, table, rows, ppss st_ts, ppss end_ts, slot, whatever)
output metadata (filename, table, rows, ppss st_ts, ppss end_ts, slot, whatever)

and use this data to understand how to resume/rollback/operate jobs on the lake

Proposed solution (b)

Just modify the ppss.yaml lake.st_ts when the job ends.

If the user runs a CLI command to roll back the pipeline, lake.st_ts should also be updated to reflect the state of the lake.

This is a KISS solution that lets us keep the SLA small.

Current solution (c)

Use min & max from pdr_predictions as the checkpoint for where the data has-been-written-to.

All other tables should be fetching and updating from this marker.

DoD:

Save ETL & workflow metadata to operate the lake.

Tasks

Stop using data-inference to start/resuming gqldf/etl jobs
Implement another way to manage incremental runs that are easy to operate from the CLI.

The text was updated successfully, but these errors were encountered:

kdetry · 2024-05-02T11:16:59Z

I have concerns about the solution b, editing the ppss.yaml file with the script is not an expected behaviour

idiom-bytes · 2024-06-25T16:50:36Z

Tracking issue in #1299 and closing this here

idiom-bytes added the Type: Enhancement New feature or request label May 1, 2024

idiom-bytes changed the title ~~[Lake][Incremental Pipeline] Save workflow metadata into duckdb such that we can more systemically operate the workflow~~ [Lake][Jobs][Incremental Pipeline] Use jobs metadata on duckdb to manage all workflows May 1, 2024

idiom-bytes changed the title ~~[Lake][Jobs][Incremental Pipeline] Use jobs metadata on duckdb to manage all workflows~~ [Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data. May 1, 2024

idiom-bytes mentioned this issue Jun 25, 2024

[ETL] ETL & Analytics Backlog #1299

Open

35 tasks

idiom-bytes closed this as completed Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data. #983

[Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data. #983

idiom-bytes commented May 1, 2024 •

edited

Loading

kdetry commented May 2, 2024

idiom-bytes commented Jun 25, 2024

[Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data. #983

[Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data. #983

Comments

idiom-bytes commented May 1, 2024 • edited Loading

Problem/motivation

Proposed solution (a)

Proposed solution (b)

Current solution (c)

DoD:

Tasks

kdetry commented May 2, 2024

idiom-bytes commented Jun 25, 2024

idiom-bytes commented May 1, 2024 •

edited

Loading