Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for consolidated remote zarr #278

Merged
merged 22 commits into from
Jul 14, 2023
Merged

Conversation

berombau
Copy link
Contributor

Closes #275

  • Uses ._store.path and Path to get to subelements.
  • Ignores hidden files

@berombau berombau changed the title Support for consolidated remote zarr WIP: Support for consolidated remote zarr May 23, 2023
@codecov
Copy link

codecov bot commented May 24, 2023

Codecov Report

Merging #278 (7e1e57b) into main (cdddb8b) will decrease coverage by 0.16%.
The diff coverage is 80.85%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #278      +/-   ##
==========================================
- Coverage   90.75%   90.59%   -0.16%     
==========================================
  Files          36       36              
  Lines        4630     4670      +40     
==========================================
+ Hits         4202     4231      +29     
- Misses        428      439      +11     
Impacted Files Coverage Δ
src/spatialdata/__main__.py 0.00% <0.00%> (ø)
src/spatialdata/_core/spatialdata.py 93.60% <83.33%> (+0.02%) ⬆️
src/spatialdata/_io/io_zarr.py 90.29% <90.14%> (-5.49%) ⬇️
src/spatialdata/_io/io_points.py 100.00% <100.00%> (ø)
src/spatialdata/_logging.py 93.75% <100.00%> (+0.89%) ⬆️

... and 1 file with indirect coverage changes

@berombau berombau changed the title WIP: Support for consolidated remote zarr Support for consolidated remote zarr May 24, 2023
@berombau
Copy link
Contributor Author

I will test it with napari-spatialdata for the consolidated version at https://dl01.irc.ugent.be/spatial/cosmx/data.zarr/, but someone with an S3 bucket should test it with s3://, maybe there is still a bug there (@LucaMarconato).

@LucaMarconato
Copy link
Member

Thanks @berombau this is amazing! 🚀

Copy link
Collaborator

@kevinyamauchi kevinyamauchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @berombau ! This is super nice. I left a couple of very minor comments below. I think you can wait until @LucaMarconato does a full review to address them. I just wanted to write them down while I was having a look.

Thanks!

src/spatialdata/_io/io_zarr.py Outdated Show resolved Hide resolved
src/spatialdata/_io/io_zarr.py Outdated Show resolved Hide resolved
@kevinyamauchi
Copy link
Collaborator

cc @cavenel (since we were discussing consolidated metadata today)

@berombau
Copy link
Contributor Author

berombau commented Jun 8, 2023

fyi python -m spatialdata peek https://dl01.irc.ugent.be/spatial/cosmx/data.zarr/ takes a minute, downloading and serving locally is faster with python -m http.server -b 0.0.0.0 and python -m spatialdata peek http://0.0.0.0:8000/cosmx/data.zarr

Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for this @berombau ! Approve in principle but this doesn't work for me,
python -m spatialdata peek https://dl01.irc.ugent.be/spatial/cosmx/data.zarr

Error: .zarr storage not found at https://dl01.irc.ugent.be/spatial/cosmx/data.zarr. Please specify a valid OME-NGFF spatial data (.zarr) file. Example "python -m spatialdata peek data.zarr"

although I can click on the link and take a look at the HTML. Do you happen of any idea why? I will try from another network just in case tomorrow

@berombau
Copy link
Contributor Author

I committed a change to the CLI a forgot to push. This should work now and is also only 10 seconds instead of minutes: python -m spatialdata peek https://dl01.irc.ugent.be/spatial/mibitof/data.zarr.

@LucaMarconato
Copy link
Member

LucaMarconato commented Jul 13, 2023

TLDR; I think it's ready to merge, I'll just wait for someone to have a look at my changes.

@berombau thanks for the PR (and sorry for the long silence on this). I tried it with the two urls you gave and the functionality is super cool!

Extended comments below.

Minor

  • I added docstrings for the selection parameters in the CLI and in the SpatialData.read and read_zarr functions.
  • I noticed that in the mibitof dataset what takes most of the time is loading the (small) table. There will have room to improve this in the future.
  • even with this PR napari doesn't work out-of-the-box. I think it's not much work to make it work with images, labels and points, we can circle back to this in a future PR.

Bug with .zmetadata

I tried to use this on the mibitof dataset, after I reuploaded it in S3 with the consolidated metadata but it didn’t work (it showed an empty dataset without any error). The reason was a bug in zarr zarr-developers/zarr-python#1121. In short zmetadata is written to disk instead of .zmetadata when the file separator is /, so when reading it fails. I believe that you didn't encounter the problem because the files present in the url that you used are "discoverable", while in a S3 storage one "can't do ls".

Bug in anndata._io.read_zarr

It looks like that anndata doesn't support reading remote zarr locations when the store is not "ls discoverable" (so again, in your case it works, but in my S3 storage it didn't. Therefore, I disabled reading the table for this last case and I will report the bug in the anndata repo.

To run the code

This code can be used to test the functionality (on the mibitof dataset, which has images, labels and the table), uploaded both in S3 and in a "ls-discoverable" network storage.

import spatialdata as sd
# benjamin's url (s3, ls discoverable)
f0 = 'https://dl01.irc.ugent.be/spatial/mibitof/data.zarr'
# luca's url (s3, non-ls discoverable)
f1 = 'https://s3.embl.de/spatialdata/test_remote/data.zarr'

sdata0 = sd.SpatialData.read(f0)
print(sdata0)

sdata1 = sd.SpatialData.read(f1)
print(sdata1)

or via terminal

python -m spatialdata peek https://dl01.irc.ugent.be/spatial/mibitof/data.zarr
python -m spatialdata peek https://s3.embl.de/spatialdata/test_remote/data.zarr

Work for the future

  • In the future we can build some tests working with remote data (we can use some datasets that I will make available online soon and that are generated nightly).
  • We can use that occasion to improve performance and to test also points on S3.
  • Note that the PR doesn't support shapes, when dask-geopandas will be available for loading shapes we will probably be able to adapt the current PR also for shapes.

Final notes

This said, I think the PR is good to merge. I wait for @giovp or @kevinyamauchi have one last look at my updates and then we can merge.

@LucaMarconato
Copy link
Member

Reported the bug in the anndata repo: scverse/anndata#1056.

Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nitpick on type and a suggestion that might not be the exact same 😅

src/spatialdata/__main__.py Outdated Show resolved Hide resolved
src/spatialdata/__main__.py Show resolved Hide resolved
src/spatialdata/_core/spatialdata.py Show resolved Hide resolved
path = os.path.join(f._store.path, f.path, "points.parquet")
# cache on remote file needed for parquet reader to work
# TODO: allow reading in the metadata without caching all the data
table = read_parquet("simplecache::" + path if "http" in path else path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

googled this cause i didn't know what it should do and found this option catalyst-cooperative/pudl#1496 (comment) and intake/intake-parquet#18 (comment) , maybe better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need to benchmark that. I'd do that in a follow up PR. @berombau WDYT?

@LucaMarconato
Copy link
Member

Reported the bug in the anndata repo: scverse/anndata#1056.

Fixed.

Thanks for the review Giovanni, merging now.

@LucaMarconato LucaMarconato enabled auto-merge (squash) July 14, 2023 16:55
@LucaMarconato LucaMarconato merged commit 9aa7525 into scverse:main Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading from remote .zarr stores
4 participants