Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data. #983

Closed
2 tasks
idiom-bytes opened this issue May 1, 2024 · 2 comments
Labels
Type: Enhancement New feature or request

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented May 1, 2024

Problem/motivation

i also had written this out somewhere... but basically, we're not keeping record of our last_run_ts. i had mentioned in the past that we'll eventually need to track start/end timestamp for the GQL/ETL workflows...

Run 1 Run 2 Run 3
Time 1:00 2:00 3:00

to initially solve this and move us forward (like ohlcv data factory), i wrote some logic such that we just query this data from our tables in order to identify the last_run_checkpoint.

to solve this in the past, i had proposed to simply mutating my_ppss.yaml such that lake st_ts is modified w/ the last_run_timestamp and the pipeline enforces being incremental rather than trying to do it through data-inference ... but, this kind of breaks the pattern for how the yaml file is being used (engine modifying it).

further, it doesn't provide a way to track the etl/workflow runs, such that they can be rolledback/operated in a systemic way

Proposed solution (a)

start tracking the jobs metadata inside a jobs table in duckdb:

  • id
  • job_name
  • job start ts
  • job end ts
  • input metadata (filename, table, rows, ppss st_ts, ppss end_ts, slot, whatever)
  • output metadata (filename, table, rows, ppss st_ts, ppss end_ts, slot, whatever)

and use this data to understand how to resume/rollback/operate jobs on the lake

Proposed solution (b)

Just modify the ppss.yaml lake.st_ts when the job ends.

If the user runs a CLI command to roll back the pipeline, lake.st_ts should also be updated to reflect the state of the lake.

This is a KISS solution that lets us keep the SLA small.

Current solution (c)

Use min & max from pdr_predictions as the checkpoint for where the data has-been-written-to.

All other tables should be fetching and updating from this marker.

DoD:

Save ETL & workflow metadata to operate the lake.

Tasks

  • Stop using data-inference to start/resuming gqldf/etl jobs
  • Implement another way to manage incremental runs that are easy to operate from the CLI.
@idiom-bytes idiom-bytes added the Type: Enhancement New feature or request label May 1, 2024
@idiom-bytes idiom-bytes changed the title [Lake][Incremental Pipeline] Save workflow metadata into duckdb such that we can more systemically operate the workflow [Lake][Jobs][Incremental Pipeline] Use jobs metadata on duckdb to manage all workflows May 1, 2024
@idiom-bytes idiom-bytes changed the title [Lake][Jobs][Incremental Pipeline] Use jobs metadata on duckdb to manage all workflows [Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data. May 1, 2024
@kdetry
Copy link
Contributor

kdetry commented May 2, 2024

I have concerns about the solution b, editing the ppss.yaml file with the script is not an expected behaviour

@idiom-bytes
Copy link
Member Author

Tracking issue in #1299 and closing this here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants