Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Slurm array jobs to limit concurrent extraction jobs #335

Merged
merged 4 commits into from
Sep 18, 2024

Conversation

takluyver
Copy link
Member

When reprocessing a whole bunch of runs at once, submit them as a Slurm array job, and ask Slurm to limit how many will run at once. This should mitigate the 'database locked' errors, without requiring people to manually batch their jobs and monitor their progress.

The limit is set at 30 concurrent jobs for now. Obviously we can make that configurable.

I've taken the shortcut of using the array task IDs for the run numbers. This avoids having to define a way to communicate 'array task 1 -> run 53', but it does mean that only the run numbers can vary within an array. So the one job we ask to update the variables table in the DB is submitted separately, and if you do reprocess all on a database spanning multiple proposals, each proposal would be submitted separately, raising the effective limit to N * 30 jobs. That's not ideal, but so far it's largely theoretical.

Closes #96.

@takluyver takluyver added the enhancement New feature or request label Sep 13, 2024
Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 88.46154% with 6 lines in your changes missing coverage. Please review.

Project coverage is 75.27%. Comparing base (c295f8d) to head (579743e).
Report is 27 commits behind head on master.

Files with missing lines Patch % Lines
damnit/backend/extraction_control.py 89.79% 5 Missing ⚠️
damnit/gui/main_window.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #335      +/-   ##
==========================================
+ Coverage   74.81%   75.27%   +0.46%     
==========================================
  Files          32       32              
  Lines        4892     4959      +67     
==========================================
+ Hits         3660     3733      +73     
+ Misses       1232     1226       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@takluyver
Copy link
Member Author

😮‍💨 no, we can't use array task IDs directly as run numbers, because:

The minimum index value is 0. the maximum value is one less than the configuration parameter MaxArraySize.

MaxArraySize            = 1001

So this will need a rework.

@takluyver
Copy link
Member Author

OK, now it creates one script per run in .tmp under the AMORE directory. The scripts clean themselves up when they run, so we don't get hundreds of tiny files left around. This allows more diverse jobs to be submitted as one array.

Copy link
Member

@JamesWrigley JamesWrigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm increasingly worried by this swathe of code that's entirely untested 😅 Maybe in the future we can add some tests that'll only run on Maxwell. Anyway, LGTM!

@takluyver
Copy link
Member Author

Good point, I'll try to figure out some ways to test this stuff better. Thanks for the review!

@takluyver takluyver merged commit 2548d5e into master Sep 18, 2024
6 checks passed
@takluyver takluyver deleted the feat/slurm-array-jobs branch September 18, 2024 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider using slurm job arrays
2 participants