Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Batch] Add n_max_attempts against infinite retries/preemptions #14682

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

cjllanwarne
Copy link
Collaborator

@cjllanwarne cjllanwarne commented Sep 12, 2024

Adds:

  • A n_max_attempts field to the client job object (defaulting to 20)
  • A n_max_attempts reader in the front_end which stores the field into the database
    • And a database field to store it in
  • A change in the batch driver to fail jobs which have already had their allowed maximum number of attempts

Some judgement call justification -

  • Why fail the job on attempt "n+1" instead of during the tidy-up of job "n"?
    • Personally I prefer a code-driven max-out over adding any more business logic into database procedures.
    • The mark job complete action is triggered from the worker VM not the driver, and I didn't want to add max-out business logic into the worker either.

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just a couple of questions.

@@ -0,0 +1 @@
ALTER TABLE jobs ADD COLUMN n_max_attempts INT NOT NULL DEFAULT 404, ALGORITHM=INSTANT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this default 404?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this was a placeholder while testing this out. This is slightly different semantically from the defaults in the code (this is a value to assign any pre-existing batches - which had de-facto infinite retries - instead of a default going forward), but for the sake of consistency, I think changing this to 20 as well is probably most sensible

@@ -1134,6 +1135,7 @@ async def _create_jobs(
spec['job_group_id'] = job_group_id

always_run = spec.pop('always_run', False)
n_max_attempts = spec.pop('n_max_attempts', 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If n_max_attempts defaults to 20, why is it 5 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot, this should probably be 20 too

Copy link
Collaborator

Could you give some context on the two different places in the driver we're checking the number of attempts? What are those two places, and why do we need two?

Also, have you done anything to test how much adding joins with the attempts table slows down those queries?

@cjllanwarne
Copy link
Collaborator Author

@patrick-schultz -

  1. Assuming you mean pool.py and job-private.py as "the two different places in the driver" - those are two types of a more general "instance collection" concept that the driver has. Pool represents the standard machine type pools that jobs can be assigned to, and job-private looks after jobs that need their own dedicated (ie "job private") VMs
  2. I haven't... both that and double checking the ALGORITHM=INSTANT claims (given that jobs is so large) are things I'd like to do. Do you have any ideas on the best way to do that? Maybe I could copy @ehigham 's tactic of duplicating the production DB temporarily and running custom queries against it? Or is there an easier/better/cheaper way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants