Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] adding an option to read file from s3 #1322

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

djarecka
Copy link

@djarecka djarecka commented Jan 19, 2024

Recently I've been testing an option to access files directly from s3 using remfile. remfile provides a file-like object for reading a remote file over HTTP, optimized for use with h5py.

I've added a simple change to the code and a simple test that access a big file directly from s3 and is able to check the file size without reading entire content. I'd be happy to expand the tests and changes if the package maintainers say that are interested in the addition.

Copy link

codecov bot commented Jan 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c1afc3e) 85.70% compared to head (18ddcd0) 83.50%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1322      +/-   ##
==========================================
- Coverage   85.70%   83.50%   -2.20%     
==========================================
  Files          34       34              
  Lines        5450     5451       +1     
==========================================
- Hits         4671     4552     -119     
- Misses        779      899     +120     
Flag Coverage Δ
gpu-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
anndata/_core/file_backing.py 90.32% <100.00%> (+0.10%) ⬆️

... and 8 files with indirect coverage changes

@flying-sheep
Copy link
Member

Hi, I think enabling HDF5-on-S3 through such a small change is pretty cool. But I think it would make sense to make an issue first, to talk about support for other formats and how to tackle them.

@djarecka
Copy link
Author

@flying-sheep - this is a bit related to #634, I've commented there at some point, but didn't get answer. I will open a fresh issue to get some new feedback

@Zethson
Copy link
Member

Zethson commented Jan 23, 2024

Maybe @Koncopd has a comment here, too?

@Koncopd
Copy link
Member

Koncopd commented Jan 23, 2024

Yes, we use s3fs for this and it works without any problems and any changes in anndata code.

@djarecka
Copy link
Author

Do you have integration of s3fs and anndata?

How can I use ad.read_h5ad with backed="r"?

@Koncopd
Copy link
Member

Koncopd commented Jan 23, 2024

@djarecka https://github.com/laminlabs/lamindb/blob/bf1c81e7ae8c177a7338d560521cc5b0c1cdc1aa/lamindb/dev/storage/file.py#L84
This loads an AnnData object directly into memory. fs there is an s3fs filesystem (or any fsspec filesystem actually).
you can also try with backed="r".
We also have fully backed functionality (not only for .X as in anndata) here https://github.com/laminlabs/lamindb/blob/bf1c81e7ae8c177a7338d560521cc5b0c1cdc1aa/lamindb/dev/storage/_backed_access.py#L609

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HDF5 Cloud Storage
4 participants