-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lazy dataframes in .obs
and .var
with backed="r"
mode
#981
Comments
.obs
and .var
with backed="r"
mode.obs
and .var
with backed="r"
mode using dask.DataFrame
.obs
and .var
with backed="r"
mode using dask.DataFrame
.obs
and .var
with backed="r"
mode
open to a pure |
Hi @gszep! I like this idea! But would note @ilan-gold is already working on something similar over in: What do you think of that approach? |
fetching partial data over the internet would make sense for datasets that are on the order of TB in size. Otherwise I don't see much of an advantage over downloading the datasets to local disk and reading from there. It should be noted that I am not very familiar with zarr arrays however so may be missing something. The aim with this PR is to expose more of the lazy constant time random access cabailities of the HDF5 format. It would be a shame to restrict it to only In terms of implementation I prefer a multiple-dispatch-like approach, so modifying the code in @_REGISTRY.register_read(H5Group, IOSpec("dataframe", "0.2.0"))
@_REGISTRY.register_read(ZarrGroup, IOSpec("dataframe", "0.2.0"))
def read_dataframe(elem, _reader):
return LazyDataFrame(elem, _reader) adding a new |
@gszep My approach will almost certainly not be zarr-specific. It's just easier for me at the moment because it makes debugging easier when I can read what is happening off of an HTTP server log. Also the following:
is a non-trivial problem, which is partially solved by my PR. In any case, the Also, out of curiosity, what sort of data do you have that is a billion rows and hundreds of columns if you can share? |
This issue has been automatically marked as stale because it has not had recent activity. |
I am having issues with time taken by analyzing big dataset memory after contributing for #1230 I implemented a pure
Then profiled and compared the two. You may reproduce using this script taking an Note: datasets can be found on CellXGene data portal, filename contains the size in cells ("_xk")
Investigating a bit with the debugger reveals that on reading h5ad backed dataset the "attributes" sub-matrices are loaded in dataframes like @gszep was pointing out. Lines 143 to 168 in c790113
So X stays backed but the attributes are loaded, thus read from the disk and allocated on the ram, and are taxing on the resources. I see that there are some contributions to address this issue. However, it is now stale. Is this on the roadmap ? |
@Neah-Ko I am actively working on a solution involving xarray. We have prototype branches (e.g., #1247). It would be great to have feedback. There are a few data type stumbling blocks with xarray which is why we are hesitant to release this. I am hopeful this can be in a release soon, though, as I am making progress. |
Relatedly, you could write your own i/o like here via read_dispatched. I imagine the arrays is |
Hi @ilan-gold, thanks for your answer. I've ran my profiling test on your code. 1. checkout on the pr code then 2. use it to load using the new feature;
On the same datasets than above, results are like this:
As a result we see:
|
@Neah-Ko I will need to check this but I would be shocked if that weren't a bug. We have tests to specifically check nothing is read in except what we have no choice but to read in (I think it's dataframes within |
@Neah-Ko Why are you using the from anndata.tests.helpers import gen_adata
from anndata.experimental import read_backed
from scipy import sparse
adata = gen_adata((1000, 1000), sparse.csc_matrix)
backed_path = "backed.zarr"
adata.write_zarr(backed_path)
remote = read_backed(backed_path)
# remote = read_backed('47bb980c-9482-4aee-9558-443c04170dec.h5ad')
size_backed = remote.__sizeof__()
size_backed_with_disk = remote.__sizeof__(with_disk=True)
size_in_memory = remote.to_memory().__sizeof__()
# the difference is orders of magnitude
assert size_backed < size_backed_with_disk and size_backed_with_disk < size_in_memory I also tried this with a dataset from CellXGene and Does this work for you? Am I misunderstanding the |
For the My initial use case was to estimate the size that a dataset is going to take into memory simply by opening it in backed mode.
So it is a bit hard for me to fully understand your While looking at the original |
@Neah-Ko I think we may be talking past each other here, which I apologize for. Let me try to understand. You are interested in benchmarking how much data is read only into memory and not on disk. So my question is: why you use the |
As far as I understand backed mode works for
.X
(and possibly.layers
?) as when I access that attribute anh5py.Dataset
class is returned. But accessing.obs
and.var
still returns apandas.DataFrame
which suggests that all the data is loaded into memory. Ideally accessing.obs
or.var
would return adask.DataFrame
(probably using their own read functions) that loads column and row slices as and when needed.The text was updated successfully, but these errors were encountered: