Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalise submission and cancellation arguments #641

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

AlecThomson
Copy link

Closes #640

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me! I do wonder if we should expose cancel_command as a keyword argument on the SLURMCluster and pass that down to SLURMJob, in case users want to override it with something else.

@AlecThomson
Copy link
Author

I found that the HTCondor class already had something similar. I've added this to the base Job class in core. The downside is that this adds some extra boilerplate.

I've reworked the HTCondorJob and SLURMJob to make use of the new functionality

@AlecThomson AlecThomson changed the title Graceful Slurm job cancellation Generalise submission and cancellation arguments May 25, 2024
@@ -54,6 +55,28 @@ def __init__(
use_stdin = dask.config.get("jobqueue.%s.use-stdin" % self.config_name)
self.use_stdin = use_stdin

if self.submit_command_extra is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is all this code both in the Core class and the class inheriting it?

@guillaumeeb
Copy link
Member

Also, the test failures in slurm (and maybe others) looks related to this change.

@jacobtomlinson
Copy link
Member

CI is now mostly happy on main so I've just merged in so we can see up to date CI failures.

@AlecThomson
Copy link
Author

Sorry for the long lead time on this, everyone. I got myself tripped up between the methods on the base and inheriting classes - as noted by @guillaumeeb's comment. I believe I've got this all sorted now. Hopefully the CI will catch any lingering issues.

@jacobtomlinson
Copy link
Member

Thanks @AlecThomson! Looks like there are some linting issues (make sure you run pre-commit install) and some slurm issues.

@AlecThomson
Copy link
Author

Hmm - looks like some kind of timing error on the test. I don't quite understand why it's failing... 🤔

>                   assert time() < start + QUEUE_WAIT
E                   assert 1723021595.3378592 < (1723021535.2718754 + 60)
E                    +  where 1723021595.3378592 = time()

https://github.com/dask/dask-jobqueue/actions/runs/10278490598/job/28450471395#step:7:425

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Aug 7, 2024

It is calling cluster.scale(n) and then waiting for the cluster to scale. The time assertion is just a timeout, so it's not scaling to the correct number in the time allowed.

Note: We don't use client.wait_for_workers(n) because that checks for "at least n workers" so doesn't wait when scaling down (xref dask/distributed#6374).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More graceful job cancellation
3 participants