Skip to content

Commit

Permalink
Add option to write BIDS directly to cloud storage (#35)
Browse files Browse the repository at this point in the history
- writing to cloud (s3 or gcs) without writing to local first
- the bids() function is overloaded to append storage() when the file has a remote URI in it
- this way, we can just add the gcs:// or s3:// prefix to the root
(output folder) config variable to enable snakemake to write to cloud remotes
- for rules writing zarr directly to remote (without writing to local first), we use a touch file, and pass 
the URI as a parameter. Also need to pass snakemake's storage settings and the gcs cred file
- however, it complicates a couple other things, e.g. expand() cannot be
applied to files with the storage tag, so we make another wrapper,
expand_bids() to make sure storage() is applied after expanding..
-helper functions for fsspec code are in
workflow/lib/cloud_io.py - I considered moving it to zarrnii, but it is
actually snakemake specific so probably better to stay as a helper
function in the snakemake workflow
  • Loading branch information
akhanf authored Jul 21, 2024
1 parent dbefbf6 commit 15a01fc
Show file tree
Hide file tree
Showing 16 changed files with 1,345 additions and 94 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
with:
python-version: '3.11'
- name: snakemake lint
run: poetry run snakemake --lint
run: poetry run snakemake --lint --storage-gcs-project placeholder
- name: snakefmt
run: poetry run snakefmt --check workflow
test:
Expand All @@ -32,4 +32,4 @@ jobs:
python-version: ${{ matrix.python-version }}
install-library: true
- name: Integration dry-run test
run: poetry run snakemake -np
run: poetry run snakemake -np --storage-gcs-project placeholder
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ A Snakemake workflow for pre-processing single plane illumination microscopy (SP

Takes TIF images (tiled or prestitched) and outputs a validated BIDS Microscopy dataset, with a multi-channel multi-scale OME-Zarr file for each scan, along with downsampled nifti images (in a derivatives folder).



## Requirements

- Linux system with Singularity/Apptainer installed
Expand Down Expand Up @@ -41,6 +43,8 @@ source venv/bin/activate

Note: The acquisition value must contain either `blaze` or `prestitched`, and defines which workflow will be used. E.g. for LifeCanvas data that is already stitched, you need to include `prestitched` in the acquisition flag.

**New:** Writing output directly to cloud storage is now supported; enable this by using `s3://` or `gcs://` in the `root` variable, to point to a bucket you have write access to.

5. The `config/config.yml` can be edited to customize any workflow parameters. The most important ones are the `root` and `work` variables. The `root` path is where the results will end up, by default this is a subfolder called `bids`. The `work` path is where any intermediate scratch files are produced. By default the files in `work` are deleted after they are no longer needed in the workflow, unless you use the `--notemp` command-line option. The workflow writes a large number of small files in parallel to the `work` folder, so for optimum performance this should be a fast local disk, and not a networked file system (i.e. shared disk).

Note: you can use environment variables when specifying `root` or `work`, e.g. so `work: '$SLURM_TMPDIR` can be used on HPC servers.
Expand Down
4 changes: 3 additions & 1 deletion config/config.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
datasets: 'config/datasets.tsv'


root: 'bids'
root: 'bids' # can use a s3:// or gcs:// prefix to write output to cloud storage
work: 'work'

remote_creds: '~/.config/gcloud/application_default_credentials.json' #this is needed so we can pass creds to container

write_ome_zarr_direct: True #use this to skip writing the final zarr output to work first and copying afterwards -- useful when work is not a fast local disk

#import wildcards: tilex, tiley, channel, zslice (and prefix - unused)
Expand Down
1,025 changes: 1,020 additions & 5 deletions poetry.lock

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ packages = []
python = ">=3.11,<3.13"
snakemake = "^8.0.0"
snakebids = "^0.13.0"
snakemake-storage-plugin-s3 = "^0.2.11"
snakemake-storage-plugin-gcs = "^1.0.0"
gcsfs = "^2024.3.1"
s3fs = "^2024.3.1"
universal-pathlib = "^0.2.2"

[tool.poetry.group.dev.dependencies]
snakefmt = "^0.10.0"
Expand Down
5 changes: 3 additions & 2 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import json
from pathlib import Path
from snakebids import bids, set_bids_spec
from upath import UPath as Path
from snakebids import set_bids_spec
import pandas as pd
from collections import defaultdict
import os
Expand All @@ -17,6 +17,7 @@ root = os.path.expandvars(config["root"])
work = os.path.expandvars(config["work"])
resampled = Path(root) / "derivatives" / "resampled"


# this is needed to use the latest bids spec with the pre-release snakebids
set_bids_spec("v0_11_0")

Expand Down
32 changes: 32 additions & 0 deletions workflow/lib/cloud_io.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from upath import UPath as Path

def is_remote(uri_string):
uri = Path(uri_string)
if uri.protocol == 'gcs' or uri.protocol == 's3':
return True
else:
return False

def get_fsspec(uri_string,storage_provider_settings=None,creds=None):
uri = Path(uri_string)
if uri.protocol == 'gcs':
from gcsfs import GCSFileSystem
gcsfs_opts={}
gcsfs_opts={'project': storage_provider_settings['gcs'].get_settings().project,
'token': creds}
fs = GCSFileSystem(**gcsfs_opts)
elif uri.protocol == 's3':
from s3fs import S3FileSystem
s3fs_opts={'anon': False}
fs = S3FileSystem(**s3fs_opts)
elif uri.protocol == 'file' or uri.protocol == 'local' or uri.protocol == '':
#assumed to be local file
from fsspec.implementations.local import LocalFileSystem
fs = LocalFileSystem()
else:
print(f'unsupported protocol for remote data')
return fs




10 changes: 5 additions & 5 deletions workflow/rules/bids.smk
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ rule raw_dataset_desc:
params:
dd=config["bids"]["raw"],
output:
json=Path(root) / "dataset_description.json",
json=bids_toplevel(root, "dataset_description.json"),
log:
"logs/dd_raw.log",
localrule: True
Expand All @@ -18,7 +18,7 @@ rule resampled_dataset_desc:
params:
dd=config["bids"]["resampled"],
output:
json=Path(resampled) / "dataset_description.json",
json=bids_toplevel(resampled, "dataset_description.json"),
log:
"logs/dd_raw.log",
localrule: True
Expand All @@ -31,7 +31,7 @@ rule bids_readme:
input:
config["bids"]["readme_md"],
output:
Path(root) / "README.md",
bids_toplevel(root, "README.md"),
log:
"logs/bids_readme.log",
localrule: True
Expand All @@ -43,7 +43,7 @@ rule bids_samples_json:
input:
config["bids"]["samples_json"],
output:
Path(root) / "samples.json",
bids_toplevel(root, "samples.json"),
log:
"logs/bids_samples_json.log",
localrule: True
Expand All @@ -55,7 +55,7 @@ rule create_samples_tsv:
params:
datasets_df=datasets,
output:
tsv=Path(root) / "samples.tsv",
tsv=bids_toplevel(root, "samples.tsv"),
log:
"logs/bids_samples_tsv.log",
localrule: True
Expand Down
Loading

0 comments on commit 15a01fc

Please sign in to comment.