Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): experimental read_backed method for zarr + hdf5 via read_dispatched #947

Closed
wants to merge 147 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
147 commits
Select commit Hold shift + click to select a range
2f73576
Start backed sparse support for zarr
ivirshup Apr 29, 2022
df160f0
Merge branch 'master' into zarr-sparse-array
ivirshup Oct 31, 2022
7983291
Merge branch 'master' into zarr-sparse-array
ivirshup Nov 8, 2022
a5e0311
Fix sparse_to_dense
ivirshup Nov 8, 2022
b28448c
Merge branch 'master' into zarr-sparse-array
ivirshup Feb 8, 2023
5e3cb02
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 8, 2023
3ee693c
Start write_dispatched
ivirshup Nov 21, 2022
7e0825a
(wip): remote reading via new AxisArrays and AnnData object
ilan-gold Feb 2, 2023
0b87230
(chore): rename
ilan-gold Feb 6, 2023
f2de515
(chore): `venv` to `.gitignore`
ilan-gold Feb 15, 2023
7bc0f76
(fix): `concatenation` test
ilan-gold Nov 29, 2022
7a12515
Revert changes to some backwards compat tests
ivirshup Jan 25, 2023
49e8069
Start fixing error reporting
ivirshup Jan 25, 2023
c3a5e07
Fixes after merge
ivirshup Feb 8, 2023
f22660d
Clean up error reporting + remove commented out code
ivirshup Feb 14, 2023
93b8778
(wip): semi-working demo?
ilan-gold Feb 17, 2023
6d32d8e
(chore): compat for old index key
ilan-gold Feb 23, 2023
3cf7036
(chore): only use `backed`
ilan-gold Feb 23, 2023
d99dd56
(feat): add custom `to_df` method
ilan-gold Feb 23, 2023
83aa3ab
(feat): get dataframe access working properly
ilan-gold Feb 23, 2023
66c86fe
(chore): remove TODO
ilan-gold Feb 23, 2023
fca6fe5
(chore): write up to-do's
ilan-gold Feb 23, 2023
2f31f91
(chore): add head method
ilan-gold Feb 23, 2023
2fad98e
(chore): add better check for `to_df`
ilan-gold Feb 27, 2023
c07e71a
(feat): categorical zarr array.
ilan-gold Feb 27, 2023
a785310
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 27, 2023
85a9006
(feat): add categorical array to the `read_remote`
ilan-gold Mar 1, 2023
568241a
(chore): remove todo
ilan-gold Mar 1, 2023
a5bd7dc
(chore): remove commented out parts
ilan-gold Mar 1, 2023
c1b090c
(chore): remove more unused methods
ilan-gold Mar 1, 2023
ef3dd22
(chore): more cleanup
ilan-gold Mar 1, 2023
3b9b838
(chore): remove unused imports from `utils`
ilan-gold Mar 1, 2023
1827e26
(chore): refactor to use `cached_property`
ilan-gold Mar 1, 2023
4c5dcbe
(chore): more rebase cleanup
ilan-gold Mar 3, 2023
009a426
(fix): correct imports
ilan-gold Mar 3, 2023
289447e
(feat): begin base `AnnData` class
ilan-gold Mar 3, 2023
796839f
(feat): being in-place view mechanism
ilan-gold Mar 7, 2023
f2f4aca
(chore): remove from to-do list, at least for now
ilan-gold Mar 7, 2023
e8f654b
(feat): abstract `_init_as_actual`
ilan-gold Mar 7, 2023
63aabc5
(feat): begin reorganizing anndata initialization
ilan-gold Mar 7, 2023
5776095
(fix): fix args + checks so that tests run
ilan-gold Mar 8, 2023
87cf635
(fix): check _X for backed
ilan-gold Mar 8, 2023
c50d9de
(fix): ensure x_indices exists
ilan-gold Mar 8, 2023
e384a19
(feat): refactor `_init_as_actual` on remote anndata
ilan-gold Mar 8, 2023
4120ed3
(fix): revert sparse changes.
ilan-gold Mar 8, 2023
6d13c82
(chore): revert erroneous comment change
ilan-gold Mar 9, 2023
3773551
Merge branch 'main' into ig/read_remote_dispatched
ilan-gold Mar 9, 2023
7c38b30
(feat): consolidate metadata by default
ilan-gold Mar 9, 2023
515a6d2
(feat): begin unit tests
ilan-gold Mar 9, 2023
f4f5c7c
(fix): fix index on `obs.to_df()`
ilan-gold Mar 9, 2023
60fae25
(feat): swap out `zarr` categorical array for `xarray` (#946)
ilan-gold Mar 16, 2023
1582864
Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched
ilan-gold Mar 16, 2023
943a4f6
Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched
ilan-gold Mar 20, 2023
7d2d39f
(fix): implement `__ne__`
ilan-gold Mar 21, 2023
7988c4b
(feat): try not reading in index
ilan-gold Mar 21, 2023
70165bb
(fix): `string_array` -> `string-array`
ilan-gold Mar 21, 2023
5f19a37
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 21, 2023
d159805
(chore): remove large comment
ilan-gold Mar 21, 2023
ff2fdfa
(style): `_to_backed` -> `to_backed`
ilan-gold Mar 21, 2023
01089fc
(fix): revert backed test
ilan-gold Mar 21, 2023
e038314
Merge branch 'main' into zarr-sparse-array
ilan-gold Mar 21, 2023
28f0218
(fix): add basic zarr backed reading for test
ilan-gold Mar 21, 2023
c552a73
(feat): add support for non-consolidated stores
ilan-gold Mar 23, 2023
58964c3
(feat): first stage of base class refactor
ilan-gold Mar 28, 2023
8ecc510
(feat): add categorical array view functionality
ilan-gold Mar 29, 2023
1e919e4
(feat): allow indexing into a view
ilan-gold Mar 29, 2023
03757e4
(chore): remove print statement
ilan-gold Mar 30, 2023
fef1e38
(style): batch `is_view` checks
ilan-gold Mar 30, 2023
30cf049
Merge branch 'zarr-sparse-array' into ig/read_remote_dispatched
ilan-gold Mar 30, 2023
6e592b3
(fix): use `sparse_dataset` in remote
ilan-gold Mar 30, 2023
c110b3b
(feat): add support for general index columns
ilan-gold Mar 30, 2023
c50e4f6
(feat): add support for `raw`
ilan-gold Mar 30, 2023
95f2e3f
(fix): ensure `file` is always present
ilan-gold Mar 30, 2023
0e45fa6
(feat): add some checks.
ilan-gold Mar 30, 2023
27849c5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 30, 2023
cec72bc
(feat): add `indptr` caching
ilan-gold Mar 30, 2023
de5b81f
Merge branch 'ig/read_remote_dispatched' of github.com:scverse/anndat…
ilan-gold Mar 30, 2023
ff20982
(fix): don't use `.X` in `__init__`
ilan-gold Apr 19, 2023
8d315be
(feat): use dask for raw arrays
ilan-gold Apr 19, 2023
772b968
(fix): clean categories
ilan-gold Apr 19, 2023
53b28a8
(fix): 1d axis array view \`to_df\`
ilan-gold Apr 19, 2023
cc76538
(chore): tests
ilan-gold Apr 19, 2023
e338ce1
(feat): lazy subset mechanism for lazy cat array
ilan-gold May 4, 2023
8233496
(feat): finish all dtypes
ilan-gold May 4, 2023
05423c3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 4, 2023
1f592b1
(fix): use base compressed class directly
ilan-gold May 5, 2023
9e07da2
Merge branch 'ig/read_remote_dispatched' of github.com:scverse/anndat…
ilan-gold May 5, 2023
a22be07
(chore): `read_remote` -> `read_backed`
ilan-gold May 8, 2023
70b4bfa
(chore); add nullable bool/int tests
ilan-gold May 8, 2023
65a0bc2
(feat): reorganize dirs
ilan-gold May 9, 2023
1c03335
(chore): add access tracking test
ilan-gold May 9, 2023
dc107ad
(feat): add more array tests
ilan-gold May 9, 2023
7aae638
(fix): repr in jupyter setting + refactor
ilan-gold May 9, 2023
dfa6f52
(fix): fix layers support
ilan-gold May 9, 2023
d856f9b
(fix): add columns attr + cleanup
ilan-gold May 9, 2023
57b87a2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 9, 2023
6ca0003
(fix): dont `deepcopy` the lazy array
ilan-gold May 10, 2023
0b2ae47
(feat): `to_memory` for `AnnData` object
ilan-gold May 10, 2023
0533748
(feat): add `to_memory` test + corresponding fixes
ilan-gold May 10, 2023
5e8f7cc
Merge branch 'ig/read_remote_dispatched' of github.com:scverse/anndat…
ilan-gold May 10, 2023
d0cdcd6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 10, 2023
d7e61d8
Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched
ilan-gold May 16, 2023
f212e30
(fix): resolve tuple ambiguity
ilan-gold May 19, 2023
a52d494
(fix): copy using backing `categories` array, not values
ilan-gold May 19, 2023
e3dcec8
(feat): sparse arrays as dask
ilan-gold May 19, 2023
a418313
(chore): add access checks on `X`, `layers
ilan-gold May 19, 2023
800f688
(fix): ensure return type of dask array
ilan-gold May 24, 2023
8ef6aea
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 24, 2023
944503e
(feat): subset_idx on backed class
ilan-gold May 25, 2023
e1d7388
Merge branch 'ig/read_remote_dispatched' of github.com:scverse/anndat…
ilan-gold May 25, 2023
15bbf78
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 25, 2023
db2a3ef
(fix): ensure old tests pass
ilan-gold May 26, 2023
e4ffa0f
Merge branch 'ig/read_remote_dispatched' of github.com:scverse/anndat…
ilan-gold May 26, 2023
e93306c
(fix): name variable accurately
ilan-gold May 31, 2023
c9de21d
(fix): use correct access pattern for `to_memory`
ilan-gold Jun 1, 2023
7c911a8
(feat): add `exclude` feature for `to_memory`
ilan-gold Jun 1, 2023
4118928
(fix): efficient reading in of matrices by splitting up reading over …
ilan-gold Jun 1, 2023
5cedb9a
(fix): legacy backed mode
ilan-gold Jun 1, 2023
627be81
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 1, 2023
1e190f4
Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched
ilan-gold Jun 6, 2023
b330443
(fix): remove `all`
ilan-gold Jun 6, 2023
74e71d0
Merge branch 'ig/read_remote_dispatched' of github.com:scverse/anndat…
ilan-gold Jun 6, 2023
64f178d
(chore): add docstrings
ilan-gold Jun 14, 2023
379c022
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 14, 2023
716162d
(fix): h5py support
ilan-gold Jun 14, 2023
bdb93a2
(feat): migrate `Dask` to `DataArray`
ilan-gold Jul 17, 2023
f78a59b
(feat): catgoricals now using `DataArray` as well.
ilan-gold Jul 17, 2023
bc28910
(feat): xarray `Dataset` for `obs`/`var`
ilan-gold Jul 19, 2023
f3b7bb3
(fix): refactor `view` mechanism
ilan-gold Jul 19, 2023
c64c7e6
(fix): fix column handling for `to_memory`
ilan-gold Jul 19, 2023
853cc0b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 19, 2023
703812c
(feat): `obsm`/`varm` `xr.Dataset`
ilan-gold Jul 30, 2023
8ce994b
(chore): refactor `ZarrArray` `subset` function
ilan-gold Jul 30, 2023
e480700
Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched
ilan-gold Jul 30, 2023
cee7f6d
Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched
ilan-gold Jul 30, 2023
b41a9a5
Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched
ilan-gold Aug 1, 2023
6fe7016
(fix): `backed` for experimental `merge.py`
ilan-gold Aug 1, 2023
c3f6935
(fix): `pyproject.toml` missing comma
ilan-gold Aug 1, 2023
deced7c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2023
8617bd1
Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched
ilan-gold Aug 8, 2023
48a134b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 8, 2023
2806a9f
(chore): remove pre-commit deps
ilan-gold Aug 8, 2023
e35603a
Merge branch 'ig/read_remote_dispatched' of github.com:scverse/anndat…
ilan-gold Aug 8, 2023
516f984
(fix): don't let ruff change `==` for `DataFrame` to `is`
ilan-gold Aug 8, 2023
60b0ae6
(chore): move `xarray` to `test` deps
ilan-gold Aug 8, 2023
9d53307
(style): change folder structure
ilan-gold Aug 8, 2023
3a428f4
Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched
flying-sheep Oct 10, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Temp files
.DS_Store
*~
venv/

# Compiled files
__pycache__/
Expand Down
6 changes: 4 additions & 2 deletions anndata/_core/aligned_mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ class AlignedMapping(cabc.MutableMapping, ABC):
"""The actual class (which has it’s own data) for this aligned mapping."""

def __repr__(self):
return f"{type(self).__name__} with keys: {', '.join(self.keys())}"
return f"{type(self).__name__} with keys: {', '.join([k + f'[{str(self[k].dtype)}]' for k in self.keys()])}"

def _ipython_key_completions_(self) -> List[str]:
return list(self.keys())
Expand Down Expand Up @@ -119,7 +119,9 @@ def copy(self):

def _view(self, parent: "anndata.AnnData", subset_idx: I):
"""Returns a subset copy-on-write view of the object."""
return self._view_class(self, parent, subset_idx)
if parent.is_view or subset_idx is not None: # and or or?
return self._view_class(self, parent, subset_idx)
return self

@deprecated("dict(obj)")
def as_dict(self) -> dict:
Expand Down
10 changes: 7 additions & 3 deletions anndata/_core/anndata.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
as_view,
_resolve_idxs,
)
from .sparse_dataset import SparseDataset
from .sparse_dataset import sparse_dataset
from .. import utils
from ..utils import convert_to_dict, ensure_df_homogeneous, dim_len
from ..logging import anndata_logger as logger
Expand All @@ -58,6 +58,7 @@
CupySparseMatrix,
_move_adj_mtx,
)
from .sparse_dataset import BaseCompressedSparseDataset


class StorageType(Enum):
Expand All @@ -67,6 +68,7 @@ class StorageType(Enum):
ZarrArray = ZarrArray
ZappyArray = ZappyArray
DaskArray = DaskArray
BaseCompressedSparseDataset = BaseCompressedSparseDataset
CupyArray = CupyArray
CupySparseMatrix = CupySparseMatrix

Expand Down Expand Up @@ -609,11 +611,13 @@ def X(self) -> Optional[Union[np.ndarray, sparse.spmatrix, ArrayView]]:
self.file.open()
X = self.file["X"]
if isinstance(X, h5py.Group):
X = SparseDataset(X)
X = sparse_dataset(X)
# This is so that we can index into a backed dense dataset with
# indices that aren’t strictly increasing
if self.is_view:
X = _subset(X, (self._oidx, self._vidx))
if isinstance(X, BaseCompressedSparseDataset):
X = X.to_memory()
elif self.is_view and self._adata_ref.X is None:
X = None
elif self.is_view:
Expand Down Expand Up @@ -674,7 +678,7 @@ def X(self, value: Optional[Union[np.ndarray, sparse.spmatrix]]):
if self.is_view:
X = self.file["X"]
if isinstance(X, h5py.Group):
X = SparseDataset(X)
X = sparse_dataset(X)
X[oidx, vidx] = value
else:
self._set_backed("X", value)
Expand Down
33 changes: 27 additions & 6 deletions anndata/_core/file_backing.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
import h5py

from . import anndata
from .sparse_dataset import SparseDataset
from ..compat import ZarrArray, DaskArray, AwkArray
from .sparse_dataset import BaseCompressedSparseDataset
from ..compat import ZarrArray, ZarrGroup, DaskArray, AwkArray


class AnnDataFileManager:
Expand Down Expand Up @@ -39,11 +39,15 @@ def __contains__(self, x) -> bool:
def __iter__(self) -> Iterator[str]:
return iter(self._file)

def __getitem__(self, key: str) -> Union[h5py.Group, h5py.Dataset, SparseDataset]:
def __getitem__(
self, key: str
) -> Union[h5py.Group, h5py.Dataset, BaseCompressedSparseDataset]:
return self._file[key]

def __setitem__(
self, key: str, value: Union[h5py.Group, h5py.Dataset, SparseDataset]
self,
key: str,
value: Union[h5py.Group, h5py.Dataset, BaseCompressedSparseDataset],
):
self._file[key] = value

Expand Down Expand Up @@ -110,8 +114,8 @@ def _(x, copy=False):
return x[...]


@to_memory.register(SparseDataset)
def _(x: SparseDataset, copy=False):
@to_memory.register(BaseCompressedSparseDataset)
def _(x: BaseCompressedSparseDataset, copy=True):
return x.to_memory()


Expand All @@ -133,3 +137,20 @@ def _(x, copy=False):
return _copy(x)
else:
return x


@singledispatch
def filename(x):
raise NotImplementedError(f"Not implemented for {type(x)}")


@filename.register(h5py.Group)
@filename.register(h5py.Dataset)
def _(x):
return x.file.filename


@filename.register(ZarrArray)
@filename.register(ZarrGroup)
def _(x):
return x.store.path
15 changes: 13 additions & 2 deletions anndata/_core/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import numpy as np
import pandas as pd
from scipy.sparse import spmatrix, issparse
from ..compat import AwkArray, DaskArray, Index, Index1D
from ..compat import AwkArray, DaskArray, Index, Index1D, ZarrArray


def _normalize_indices(
Expand Down Expand Up @@ -116,16 +116,25 @@ def unpack_index(index: Index) -> Tuple[Index1D, Index1D]:


@singledispatch
def _subset(a: Union[np.ndarray, pd.DataFrame], subset_idx: Index):
def _subset(a: np.ndarray, subset_idx: Index):
# Select as combination of indexes, not coordinates
# Correcting for indexing behaviour of np.ndarray
if all(isinstance(x, cabc.Iterable) for x in subset_idx):
subset_idx = np.ix_(*subset_idx)
return a[subset_idx]


@_subset.register(ZarrArray)
def _subset_zarr(a: ZarrArray, subset_idx: Index):
if all(isinstance(x, cabc.Iterable) for x in subset_idx):
subset_idx = np.ix_(*subset_idx)
return a.oindex[subset_idx]


@_subset.register(DaskArray)
def _subset_dask(a: DaskArray, subset_idx: Index):
if isinstance(subset_idx, slice):
return a[subset_idx]
if all(isinstance(x, cabc.Iterable) for x in subset_idx):
subset_idx = np.ix_(*subset_idx)
return a.vindex[subset_idx]
Expand All @@ -147,6 +156,8 @@ def _subset_df(df: pd.DataFrame, subset_idx: Index):

@_subset.register(AwkArray)
def _subset_awkarray(a: AwkArray, subset_idx: Index):
if isinstance(subset_idx, slice):
return a[subset_idx]
if all(isinstance(x, cabc.Iterable) for x in subset_idx):
subset_idx = np.ix_(*subset_idx)
return a[subset_idx]
Expand Down
7 changes: 4 additions & 3 deletions anndata/_core/raw.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from . import anndata
from .index import _normalize_index, _subset, unpack_index, get_vector
from .aligned_mapping import AxisArrays
from .sparse_dataset import SparseDataset
from .sparse_dataset import BaseCompressedSparseDataset, sparse_dataset

from ..compat import CupyArray, CupySparseMatrix

Expand Down Expand Up @@ -49,7 +49,7 @@ def _get_X(self, layer=None):
return self.X

@property
def X(self) -> Union[SparseDataset, np.ndarray, sparse.spmatrix]:
def X(self) -> Union[BaseCompressedSparseDataset, np.ndarray, sparse.spmatrix]:
# TODO: Handle unsorted array of integer indices for h5py.Datasets
if not self._adata.isbacked:
return self._X
Expand All @@ -66,7 +66,7 @@ def X(self) -> Union[SparseDataset, np.ndarray, sparse.spmatrix]:
f"{self._adata.file.filename}."
)
if isinstance(X, h5py.Group):
X = SparseDataset(X)
X = sparse_dataset(X)
# Check if we need to subset
if self._adata.is_view:
# TODO: As noted above, implement views of raw
Expand Down Expand Up @@ -187,6 +187,7 @@ class _RawViewHack:
def __init__(self, raw: Raw, vidx: Union[slice, np.ndarray]):
self.parent_raw = raw
self.vidx = vidx
self.is_view = True

@property
def shape(self) -> Tuple[int, int]:
Expand Down
Loading
Loading