Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow deletion of older runs' records for scheduled reconciliation tasks #157

Open
jiawen-tw opened this issue Mar 21, 2022 · 2 comments
Open
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@jiawen-tw
Copy link
Contributor

jiawen-tw commented Mar 21, 2022

Context / Goal

For each reconciliation run, it will generate many reconciliation run records inside the database. Specifically, the reconciliation_record table will have as many rows as there are migration keys in the dataset.

With each new reconciliation run, the older runs' results also becomes less meaningful and are less likely to be accessed by user.

For regularly scheduled runs, this would accumulate a large amount of data laying around in the database which can incur significant fees overtime.

Expected Outcome

  • Provide a configuration to the @scheduled reconciliation task to allow users to deletes runs older than X regardless of dataset

Out of Scope

Additional context / implementation notes

@jiawen-tw jiawen-tw added the enhancement New feature or request label Mar 21, 2022
@chadlwilson chadlwilson added the good first issue Good for newcomers label Apr 28, 2022
@jordan396-tw jordan396-tw self-assigned this May 6, 2022
@aditi-agarwal-tw
Copy link

aditi-agarwal-tw commented May 12, 2022

Few questions:

  1. Will X be a timestamp field?
  2. Will the older runs be cleaned up only if the current run is a success? My worry is if I have multiple consecutive failed runs, and if X is not chosen wisely, I could end up with no successful runs history in my db.
  3. The schedule config is right now at the dataset level, so the X should also be applied at dataset level? what does regardless of dataset mean?
  4. Given schedule is an optional config, if it is applied after some manual runs, will the cleanup also remove the manual runs before X given there is no way to differentiate manual and scheduled runs?
  5. Cleanup job should match X with the completedTime to determine which runs to remove?

@aditi-agarwal-tw
Copy link

aditi-agarwal-tw commented May 19, 2022

Tasks to be done:

  • Add a new optional config to schedule object - timestamp field - for now its ignored, only valid for schedules with cron expression.
  • If the run is scheduled run and successful, call a handler for deleting records - does nothing for now.
  • Implement the handler to check the validity of the timestamp field -> throw exception if invalid, do nothing if valid.
  • Implement the handler to delete records from the reconciliation run db cascade <= X, compare based on completed time
  • Write integration tests
  • Modify API to accept this parameter in the request for schedule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants