Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow CWL workflows to have jobs use all of a Slurm node's memory #5052

Merged
merged 39 commits into from
Oct 14, 2024

Conversation

adamnovak
Copy link
Member

This should fix #4971. Instead of --defaultMemory 0, to run CWL jobs that lack their own ramMin with a full Slurm node's memory, you would now pass --no-cwl-default-ram --slurmDefaultAllMem=True.

This might cause some new problems:

  • Other internal CWL runner jobs that expected to use the default memory would now use all the memory on their node, if submitted to the cluster.
  • Now we use the CWL spec's required default memory for jobs that don't specify a limit, instead of --defaultMemory unless the user passes --no-cwl-default-ram. Previously I think we were ignoring the spec and always using the Toil --defaultMemory. This might break some workflow runs that used to work because of us giving them more memory than the spec said to.

Also, #4971 says we're supposed to implement a real framework for doing this kind of memory expansion across all batch systems that support it. But I didn't want to add a new bool flag onto Requirer for such a specific purpose. Probably if we need it we should combine it with preemptible somehow into a tag/flag system. Or we could implement memory range requirements and allow the top of the range to be unbounded, or treat some threshold upper limit as "all the node's memory" in the Slurm batch system.

Changelog Entry

To be copied to the draft changelog by merger:

  • Toil now has a --slurmDefaultAllMem option to run jobs lacking their own memory requirements with Slurm's --mem=0, so they get a whole node's memory.
  • toil-cwl-runner now has --no-cwl-default-ram (and --cwl-default-ram) to control whether the CWL spec's default ramMin is applied, or Toil's own default memory logic is used.
  • The --dont_allocate_mem and --allocate_mem options have been deprecated and replaced with --slurmAllocateMem, which can be True or False.

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@adamnovak adamnovak changed the title Issues/4971 slurm node memory Allow CWL workflows to have jobs use all of a Slurm node's memory Aug 8, 2024
@adamnovak
Copy link
Member Author

I still need to manually test this to make sure it actually does what it is meant to do.

@adamnovak
Copy link
Member Author

I wrote a test for this and it does indeed seem to issue jobs that ask for whole Slurm nodes when I use the two new options together.

I also fixed Slurm job cleanup when a workflow is killed; it wasn't doing that before because shutdown() wasn't doing any killing in AbstractGridEngineBatchSystem. I needed this for the test to not leave behind pending jobs when there aren't any free entire cluster nodes.

@adamnovak adamnovak marked this pull request as ready for review August 8, 2024 20:46
@adamnovak
Copy link
Member Author

@DailyDreaming Can you review this?

Copy link
Member

@DailyDreaming DailyDreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some minor clean-up.

@@ -131,6 +131,9 @@ There are several environment variables that affect the way Toil runs.
| TOIL_GOOGLE_PROJECTID | The Google project ID to use when generating |
| | Google job store names for tests or CWL workflows. |
+----------------------------------+----------------------------------------------------+
| TOIL_SLURM_ALLOCATE_MEM | Whether to akllocate memory in Slurm with --mem. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

akllocate

@@ -143,6 +146,8 @@ There are several environment variables that affect the way Toil runs.
| | jobs. |
| | There is no default value for this variable. |
+----------------------------------+----------------------------------------------------+
| TOIL_SLURM_TIME | Slurm job time limit, in [DD-]HH:MM:SS format. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example might be useful (literally adding e.g. TOIL_SLURM_TIME="04:00:00" for 4 hours, assuming that this produces a 4 hour time limit), or a link to SLURM docs if this is a direct alias for a slurm arg.


Besides the normal Toil options and the options supported by cwltool, toil-cwl-runner adds some of its own options:

--bypass-file-store Do not use Toil's file store and assume all paths are accessible in place from all nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a further explanation that files may not be redundantly copied to a potentially remote filestore, and this can speed up the workflow (as a reason for why someone might want this). Has the downside that the workflow is not restart-able from the filestore if it made progress and failed part-way through.

--slurmTime SLURM_TIME
Slurm job time limit, in [DD-]HH:MM:SS format.
--slurmPE SLURM_PE Special partition to send Slurm jobs to if they ask
for more than 1 CPU.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be explained further?

@@ -417,11 +433,11 @@ def issueBatchJob(self, command: str, jobDesc, job_environment: Optional[Dict[st
else:
gpus = jobDesc.accelerators
Copy link
Member

@DailyDreaming DailyDreaming Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks like it was maybe left in and is a function now (i.e. self.count_needed_gpus)?

"--runImportsOnWorkers", "--run-imports-on-workers",
action="store_true",
default=False,
help=suppress_help or "Run the file imports on a worker instead of the leader. This is useful if the leader is not optimized for high network performance."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space needed at the end of this sentence.

)
parser.add_argument(
"--importWorkersDisk", "--import-workers-disk",
help=suppress_help or "Specify the amount of disk space an import worker will use. If file streaming for input files is not available,"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space needed at the end of this sentence.

# Reap it
child.wait()
# The test passes
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this return is a no-op?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I put it there to make it clear that we intend to go there and end the test. But if you think it makes more sense without the return I'll drop it.

@adamnovak adamnovak merged commit 631ae0c into master Oct 14, 2024
3 checks passed
@adamnovak adamnovak deleted the issues/4971-slurm-node-memory branch October 14, 2024 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

toil-cwl-runner used to allow --defaultMemory 0, which has special meaning to Slurm, but now no longer does
2 participants