Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more methods of WDL task disk specification #5001

Merged
merged 78 commits into from
Oct 10, 2024

Conversation

stxue1
Copy link
Contributor

@stxue1 stxue1 commented Jun 29, 2024

Closes #4995

Changelog Entry

To be copied to the draft changelog by merger:

  • Support WDL disk specification as per 1.1 spec

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to go back and get clarification from the WDL folks before we can implement this properly.

The spec talks about "persistent volumes", but doesn't really explain what those are or which tasks would expect to be able to read which other tasks' writes. The implementation here doesn't actually provide any kind of persistence that I can see, unless running somewhere where the worker nodes have persistent filesystems.

It's not really clear to me whether we're meant to be mounting arbitrary host paths into the containers, or doing something more like Docker volumes. There's a requirement for the host-side path to exist but no other evidence that the task would be able to expect to actually access anything at that host-side path, and the execution engine is somehow responsible for making the volumes be the right size, which is impossible if it is just meant to mount across whatever's already there.

Can we dig up any workflows that genuinely use the mountpoint feature as more than just a test case, to see how they expect it to behave? Can we find or elicit any more explanation from the spec authors as to what the mount point feature is meant to accomplish?

It also might not really make sense to implement this on our own in Toil without some support from MiniWDL. We rely on MiniWDL for ordering up an appropriate container given the runtime spec, and unless we need to hook it into Toil's job requirements logic or make the batch system do special things with batch-system-specific persistent storage, it would be best if we could just get this feature for free when it shows up in MiniWDL.

Comment on lines 1805 to 1807
if not os.path.exists(part_mount_point):
# this isn't a valid mount point
raise NotImplementedError(f"Cannot use mount point {part_mount_point} as it does not exist")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the spec says we're supposed to require (the mount point needs to exist on the "host system"), but I've read the disks section of the spec twice and that requirement doesn't really make any sense, because we're mounting storage into the container. The spec doesn't say we actually do anything with this path on the host.

Comment on lines 2241 to 2285
singularity_original_prepare_mounts = task_container.prepare_mounts

def patch_prepare_mounts_singularity() -> List[Tuple[str, str, bool]]:
"""
Mount the mount points specified from the disk requirements.

The singularity and docker patch are separate as they have different function signatures
"""
# todo: support AWS EBS/Kubernetes persistent volumes
# this logic likely only works for local clusters as we don't deal with the size of each mount point
mounts: List[Tuple[str, str, bool]] = singularity_original_prepare_mounts()
# todo: support AWS EBS/Kubernetes persistent volumes
# this logic likely only works for local clusters as we don't deal with the size of each mount point
for mount_point, _ in self._mount_spec.items():
abs_mount_point = os.path.abspath(mount_point)
mounts.append((abs_mount_point, abs_mount_point, True))
return mounts
task_container.prepare_mounts = patch_prepare_mounts_singularity # type: ignore[method-assign]
elif isinstance(task_container, SwarmContainer):
docker_original_prepare_mounts = task_container.prepare_mounts

try:
# miniwdl depends on docker so this should be available but check just in case
import docker
# docker stubs are still WIP: https://github.com/docker/docker-py/issues/2796
from docker.types import Mount # type: ignore[import-untyped]

def patch_prepare_mounts_docker(logger: logging.Logger) -> List[Mount]:
"""
Same as the singularity patch but for docker
"""
mounts: List[Mount] = docker_original_prepare_mounts(logger)
for mount_point, _ in self._mount_spec.items():
abs_mount_point = os.path.abspath(mount_point)
mounts.append(
Mount(
abs_mount_point.rstrip("/").replace("{{", '{{"{{"}}'),
abs_mount_point.rstrip("/").replace("{{", '{{"{{"}}'),
type="bind",
)
)
return mounts
task_container.prepare_mounts = patch_prepare_mounts_docker # type: ignore[method-assign]
except ImportError:
logger.warning("Docker package not installed. Unable to add mount points.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it makes sense to add the ability to make these mounts in Toil as a monkey-patch. There's nothing here that wouldn't make just as much sense in MiniWDL (or more, since MiniWDL actually has a shared filesystem as an assumption), so instead of doing monkey-patches we should PR this machinery to MiniWDL.

Or if we're starting to add multiple monkey-patches to the TaskContainers, maybe we really want to extend them instead?

Copy link
Contributor Author

@stxue1 stxue1 Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably is a good idea to PR this machinery to MiniWDL. The main reason why I had to monkeypatch this in the first place is because MiniWDL doesn't actually support the disks runtime attribute. Instead of PR-ing the code one-to-one, I think a PR that adds functionality to extend the list of mount points for both docker and singularity is best

# this logic likely only works for local clusters as we don't deal with the size of each mount point
for mount_point, _ in self._mount_spec.items():
abs_mount_point = os.path.abspath(mount_point)
mounts.append((abs_mount_point, abs_mount_point, True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How sure are you that the spec actually means you're supposed to mount this particular host path through to the container?

It does say that the host path is required to exist. But it also says a lot about mounting "persistent volumes" at these mount points, and making them the required size. It doesn't seem to say anything about mounting those host paths specifically into the container.

What if you ask for a 100 GiB /tmp mount, and /tmp exists on the host, but /tmp on the host is only 10 GiB and Toil is actually doing all its work on a much larger /scratch/tmp? Shouldn't you get a 100 GiB /tmp mount in the container that actually refers to some location in /scratch/tmp on the host?

If the mount point feature was really meant to mount particular host paths across, wouldn't it take both a host-side path and a container-side path like the actual underlying mount functionality uses?

@adamnovak
Copy link
Member

@stxue1 Don't we now have the clarification we need so that we can finish this? I think the WDL spec is being revised to make it clearer that the mounts are meant to request so much storage available at such-and-such a path in the container, and that they are not actually meant to mount specific paths into the container. But we still need the changes in this PR adding array-of-mounts support.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better but I think there are still ways to break it.

If the parsing was broken out into its own function with tests for its behavior on both allowed and disallowed disk specification sets, it would be easier to convincingly say that the implementation is correct.

src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
@@ -1979,15 +1984,23 @@ def run(self, file_store: AbstractFileStore) -> Promised[WDLBindings]:

per_part_size = convert_units(part_size, part_suffix)
total_bytes += per_part_size
if specified_mount_point is not None:
if mount_spec.get(specified_mount_point) is not None:
if mount_spec.get(specified_mount_point) is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check doesn't account for how local-disk and no mount point mean the same thing, so will still let you specify both.

If there was a test for this validation logic, that would catch the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mapped local-disk and no mount point to the same thing in a conditional branch below. I changed the logic a bit so it's more clear and so there's only one conditional branch that does this duplication check.

src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good now.

Comment on lines +2509 to +2513
source_location = os.path.join(tmpdir, str(uuid.uuid4()))
os.mkdir(source_location)
if mount_target is not None:
# None represents an omitted mount point, which will default to the task's work directory. MiniWDL's internals will mount the task's work directory by itself
mount_src_mapping[mount_target] = source_location
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a directory for omitted mount points even if it never uses them.

We also sort of assume Docker/Singularity is pulling from the same storage we are for /, but never really enforce/check that.

@stxue1 stxue1 enabled auto-merge (squash) October 9, 2024 19:26
@stxue1 stxue1 merged commit 4514bfa into master Oct 10, 2024
1 check passed
@stxue1 stxue1 deleted the issues/4995-disk-spec-wdl branch October 10, 2024 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow array of disk specifications for WDL
2 participants