first attempt to support awkward arrays (#647)

* first attempt to support awkward arrays * remove comments * better comment * add type to gen_adata * first attempt at concat * remove comment * add outer concat * add awkward to test dep * add awk arr to data gen * fix test base * init test for concat * fix concatenate tests * create mock class for awkward array * remove space * import ak when needed * relative import of awk array * fix optional dep import * resolve conflicts * draft IO for akward arrays * add awkward to docs and save form to attrs * Update dependencies * Update dim_len * ignore vscode directory * Validate that awkward arrays align to axes * Fix reindexing during merge * fix lint * remove duplicate import * Test different types of awkward arrays in different slots * Better function to generate awkward arrays * Better dim_len for awkward arrays * Working out how to best check the dim_len * Only accept awkward arrays that are "regular" in the aligned dimension The conversion is left to the user. Explicit is better than implicit. * Switch to v2 API * WIP rewrite awkward array generation * Improve awkward array generation and dim_len check * Switch to new awkward array generation in all tests * Fix test_transpose * Fix/workaround more tests * Add test for setting anndata slots to awkward arrays * enable tests for 3d ragged array in layers * Cleanup * Fix that X could not be set when creating AnnData object from scratch. Apparently the checks are quite different than when adding a Layer. * Remove code to make awkward array regular after merge. This is now done by the awkward array library. * Do not explicitly copy awkward arrays * Implement transposing awkward arrays * Add docs stub and update type hints * Fix: dtype not available during merge if both X are awkward * Fix IO * Request pre-release version of awkward * Exclude awkward layer in loom tests * Pull in only changes relevant to obsm/varm * Update tests * Fix type hints * Update error message in algined mapping * Use compat module to support both awkward v1.9rc and 2.x * restructure tests * Add tests for copies and view * Remove unused imoport * Fix how actual shape is computed in aligned mapping * Attempt to support views with ak.behavior * Use shallow copy * Add dim_len_awkward function including tests * Test that assigning an awkward v1 arrays fails * Add stub for element-wise IO tests * Restructur dim_len_awkward * Add more test cases for awkward IO * WIP add tests for concatenating AwkArrays with missing values * Fix AwkwardArrayView * Simplify awkward array view code * Use None to remove name from awkward array * Mark test_no_awkward_v1 as xfail for uns * Add test for categorical arrays * Update docs/fileformat-prose.rst Co-authored-by: Isaac Virshup <[email protected]> * Update anndata/_core/aligned_mapping.py Co-authored-by: Isaac Virshup <[email protected]> * Update anndata/tests/helpers.py Co-authored-by: Isaac Virshup <[email protected]> * Update awkward tests to use assert_equal with exact=True * Bump required version * Update categorical syntax, add new categorical test * Start concat tests for awkward * Add release notes * Add testcases for dim_len with awkward arrays of strings * Fix dim_len for arrays of strings * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Awkward v2 fixes Several functions changed until the stable awkward v2 version was released. * Exclude awkward arrays from fill_value concat test * fix flake8 * Add IO testcase for AIRR data * Fix link * Get inner join working for concatenation * Bump some concatenation cases to a later PR * Generate empty arrays for outer join * Raise NotImplementedError when creating a view of an awkward array with custom behavior * Add warning when setting awkward array in aligned mapping * Get much more of concatenation 'working' * Use warning instead of logging * extend todo comment about views * Fix IO, and to_memory for views of awkward arrays * Removed a number of test cases that we're not targeting This fixed a number of tests because we had a 1d awkward array being generated, and we currently don't support 1d arrays in obsm well. Tracked in #652. * Implement outer indexing on axis 0 of an awkward array * Fix gen_awkward when one of the dimensions has size 0 * Fix equality function for awkward arrays. Was throwing an error when the arrays weren't broadcastable. * Modify outer concatenation test to accept current behaviour of awkward array * Add tests for mixed type concatenation with awkward arrays * Add warning about outer joins * Call ak._util.arrays_approx_equal instead of rolling our own * update awkward to 2.0.7 (unfortunately: errors) * remove unnecessary checks from AwkwardArrayView * Workaround scikit-hep/awkward#2209 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed extra layer of nesting from on-disk format for awkward arrays --------- Co-authored-by: Gregor Sturm <[email protected]> Co-authored-by: Isaac Virshup <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
scverse · Feb 7, 2023 · a9e634c · a9e634c
1 parent 4ccf91c
commit a9e634c
Show file tree

Hide file tree

Showing 19 changed files with 1,049 additions and 32 deletions.
diff --git a/.gitignore b/.gitignore
@@ -27,4 +27,5 @@ test.h5ad
 
 # IDEs
 /.idea/
+/.vscode/
 
diff --git a/anndata/__init__.py b/anndata/__init__.py
@@ -18,7 +18,12 @@
         read_mtx,
         read_zarr,
     )
-    from ._warnings import OldFormatWarning, WriteWarning, ImplicitModificationWarning
+    from ._warnings import (
+        OldFormatWarning,
+        WriteWarning,
+        ImplicitModificationWarning,
+        ExperimentalFeatureWarning,
+    )
 
     # backwards compat / shortcut for default format
     from ._io import read_h5ad as read
diff --git a/anndata/_core/aligned_mapping.py b/anndata/_core/aligned_mapping.py
@@ -1,18 +1,22 @@
 from abc import ABC, abstractmethod
 from collections import abc as cabc
+from copy import copy
 from typing import Union, Optional, Type, ClassVar, TypeVar  # Special types
 from typing import Iterator, Mapping, Sequence  # ABCs
 from typing import Tuple, List, Dict  # Generic base types
+import warnings
 
 import numpy as np
 import pandas as pd
 from scipy.sparse import spmatrix
 
-from ..utils import deprecated, ensure_df_homogeneous
+from ..utils import deprecated, ensure_df_homogeneous, dim_len
 from . import raw, anndata
 from .views import as_view
 from .access import ElementRef
 from .index import _subset
+from anndata.compat import AwkArray
+from anndata._warnings import ExperimentalFeatureWarning
 
 
 OneDIdx = Union[Sequence[int], Sequence[bool], slice]
@@ -46,15 +50,37 @@ def _ipython_key_completions_(self) -> List[str]:
 
     def _validate_value(self, val: V, key: str) -> V:
         """Raises an error if value is invalid"""
+        if isinstance(val, AwkArray):
+            warnings.warn(
+                "Support for Awkward Arrays is currently experimental. "
+                "Behavior may change in the future. Please report any issues you may encounter!",
+                ExperimentalFeatureWarning,
+                # stacklevel=3,
+            )
+            # Prevent from showing up every time an awkward array is used
+            # You'd think `once` works, but it doesn't at the repl and in notebooks
+            warnings.filterwarnings(
+                "ignore",
+                category=ExperimentalFeatureWarning,
+                message="Support for Awkward Arrays is currently experimental.*",
+            )
         for i, axis in enumerate(self.axes):
-            if self.parent.shape[axis] != val.shape[i]:
+            if self.parent.shape[axis] != dim_len(val, i):
                 right_shape = tuple(self.parent.shape[a] for a in self.axes)
-                raise ValueError(
-                    f"Value passed for key {key!r} is of incorrect shape. "
-                    f"Values of {self.attrname} must match dimensions "
-                    f"{self.axes} of parent. Value had shape {val.shape} while "
-                    f"it should have had {right_shape}."
-                )
+                actual_shape = tuple(dim_len(val, a) for a, _ in enumerate(self.axes))
+                if actual_shape[i] is None and isinstance(val, AwkArray):
+                    raise ValueError(
+                        f"The AwkwardArray is of variable length in dimension {i}.",
+                        f"Try ak.to_regular(array, {i}) before including the array in AnnData",
+                    )
+                else:
+                    raise ValueError(
+                        f"Value passed for key {key!r} is of incorrect shape. "
+                        f"Values of {self.attrname} must match dimensions "
+                        f"{self.axes} of parent. Value had shape {actual_shape} while "
+                        f"it should have had {right_shape}."
+                    )
+
         if not self._allow_df and isinstance(val, pd.DataFrame):
             name = self.attrname.title().rstrip("s")
             val = ensure_df_homogeneous(val, f"{name} {key!r}")
@@ -84,7 +110,11 @@ def parent(self) -> Union["anndata.AnnData", "raw.Raw"]:
     def copy(self):
         d = self._actual_class(self.parent, self._axis)
         for k, v in self.items():
-            d[k] = v.copy()
+            if isinstance(v, AwkArray):
+                # Shallow copy since awkward array buffers are immutable
+                d[k] = copy(v)
+            else:
+                d[k] = v.copy()
         return d
 
     def _view(self, parent: "anndata.AnnData", subset_idx: I):

diff --git a/anndata/_core/anndata.py b/anndata/_core/anndata.py
@@ -45,7 +45,7 @@
 )
 from .sparse_dataset import SparseDataset
 from .. import utils
-from ..utils import convert_to_dict, ensure_df_homogeneous
+from ..utils import convert_to_dict, ensure_df_homogeneous, dim_len
 from ..logging import anndata_logger as logger
 from ..compat import (
     ZarrArray,
@@ -55,6 +55,7 @@
     _move_adj_mtx,
     _overloaded_uns,
     OverloadedDict,
+    AwkArray,
 )
 
 
@@ -1861,7 +1862,7 @@ def _check_dimensions(self, key=None):
         if "obsm" in key:
             obsm = self._obsm
             if (
-                not all([o.shape[0] == self._n_obs for o in obsm.values()])
+                not all([dim_len(o, 0) == self._n_obs for o in obsm.values()])
                 and len(obsm.dim_names) != self._n_obs
             ):
                 raise ValueError(
@@ -1871,7 +1872,7 @@ def _check_dimensions(self, key=None):
         if "varm" in key:
             varm = self._varm
             if (
-                not all([v.shape[0] == self._n_vars for v in varm.values()])
+                not all([dim_len(v, 0) == self._n_vars for v in varm.values()])
                 and len(varm.dim_names) != self._n_vars
             ):
                 raise ValueError(

diff --git a/anndata/_core/file_backing.py b/anndata/_core/file_backing.py
@@ -8,7 +8,7 @@
 
 from . import anndata
 from .sparse_dataset import SparseDataset
-from ..compat import ZarrArray, DaskArray
+from ..compat import ZarrArray, DaskArray, AwkArray
 
 
 class AnnDataFileManager:
@@ -123,3 +123,13 @@ def _(x, copy=True):
 @to_memory.register(Mapping)
 def _(x: Mapping, copy=True):
     return {k: to_memory(v, copy=copy) for k, v in x.items()}
+
+
+@to_memory.register(AwkArray)
+def _(x, copy=True):
+    from copy import copy
+
+    if copy:
+        return copy(x)
+    else:
+        return x
diff --git a/anndata/_core/index.py b/anndata/_core/index.py
@@ -7,7 +7,7 @@
 import numpy as np
 import pandas as pd
 from scipy.sparse import spmatrix, issparse
-from ..compat import DaskArray, Index, Index1D
+from ..compat import AwkArray, DaskArray, Index, Index1D
 
 
 def _normalize_indices(
@@ -145,6 +145,13 @@ def _subset_df(df: pd.DataFrame, subset_idx: Index):
     return df.iloc[subset_idx]
 
 
+@_subset.register(AwkArray)
+def _subset_awkarray(a: AwkArray, subset_idx: Index):
+    if all(isinstance(x, cabc.Iterable) for x in subset_idx):
+        subset_idx = np.ix_(*subset_idx)
+    return a[subset_idx]
+
+
 # Registration for SparseDataset occurs in sparse_dataset.py
 @_subset.register(h5py.Dataset)
 def _subset_dataset(d, subset_idx):

diff --git a/anndata/_core/merge.py b/anndata/_core/merge.py
@@ -18,7 +18,7 @@
     Literal,
 )
 import typing
-from warnings import warn
+from warnings import warn, filterwarnings
 
 from natsort import natsorted
 import numpy as np
@@ -27,9 +27,10 @@
 from scipy.sparse import spmatrix
 
 from .anndata import AnnData
-from ..utils import asarray
-from ..compat import DaskArray
+from ..compat import AwkArray, DaskArray
+from ..utils import asarray, dim_len
 from .index import _subset, make_slice
+from anndata._warnings import ExperimentalFeatureWarning
 
 T = TypeVar("T")
 
@@ -154,6 +155,13 @@ def equal_sparse(a, b) -> bool:
         return False
 
 
+@equal.register(AwkArray)
+def equal_awkward(a, b) -> bool:
+    from ..compat import awkward as ak
+
+    return ak.almost_equal(a, b)
+
+
 def as_sparse(x):
     if not isinstance(x, sparse.spmatrix):
         return sparse.csr_matrix(x)
@@ -366,12 +374,14 @@ def apply(self, el, *, axis, fill_value=None):
 
         Missing values are to be replaced with `fill_value`.
         """
-        if self.no_change and (el.shape[axis] == len(self.old_idx)):
+        if self.no_change and (dim_len(el, axis) == len(self.old_idx)):
             return el
         if isinstance(el, pd.DataFrame):
             return self._apply_to_df(el, axis=axis, fill_value=fill_value)
         elif isinstance(el, sparse.spmatrix):
             return self._apply_to_sparse(el, axis=axis, fill_value=fill_value)
+        elif isinstance(el, AwkArray):
+            return self._apply_to_awkward(el, axis=axis, fill_value=fill_value)
         elif isinstance(el, DaskArray):
             return self._apply_to_dask_array(el, axis=axis, fill_value=fill_value)
         else:
@@ -468,6 +478,22 @@ def _apply_to_sparse(self, el: spmatrix, *, axis, fill_value=None) -> spmatrix:
 
         return out
 
+    def _apply_to_awkward(self, el: AwkArray, *, axis, fill_value=None):
+        import awkward as ak
+
+        if self.no_change:
+            return el
+        elif axis == 1:  # Indexing by field
+            if self.new_idx.isin(self.old_idx).all():  # inner join
+                return el[self.new_idx]
+            else:  # outer join
+                # TODO: this code isn't actually hit, we should refactor
+                raise Exception("This should be unreachable, please open an issue.")
+        else:
+            if len(self.new_idx) > len(self.old_idx):
+                el = ak.pad_none(el, 1, axis=axis)  # axis == 0
+            return el[self.old_idx.get_indexer(self.new_idx)]
+
 
 def merge_indices(
     inds: Iterable[pd.Index], join: Literal["inner", "outer"]
@@ -534,6 +560,17 @@ def concat_arrays(arrays, reindexers, axis=0, index=None, fill_value=None):
         )
         df.index = index
         return df
+    elif any(isinstance(a, AwkArray) for a in arrays):
+        from ..compat import awkward as ak
+
+        if not all(
+            isinstance(a, AwkArray) or a is MissingVal or 0 in a.shape for a in arrays
+        ):
+            raise NotImplementedError(
+                "Cannot concatenate an AwkwardArray with other array types."
+            )
+
+        return ak.concatenate([f(a) for f, a in zip(reindexers, arrays)], axis=axis)
     elif any(isinstance(a, sparse.spmatrix) for a in arrays):
         sparse_stack = (sparse.vstack, sparse.hstack)[axis]
         return sparse_stack(
@@ -579,6 +616,15 @@ def gen_inner_reindexers(els, new_index, axis: Literal[0, 1] = 0):
             lambda x, y: x.intersection(y), (df_indices(el) for el in els)
         )
         reindexers = [Reindexer(df_indices(el), common_ind) for el in els]
+    elif any(isinstance(el, AwkArray) for el in els if not_missing(el)):
+        if not all(isinstance(el, AwkArray) for el in els if not_missing(el)):
+            raise NotImplementedError(
+                "Cannot concatenate an AwkwardArray with other array types."
+            )
+        common_keys = intersect_keys(el.fields for el in els)
+        reindexers = [
+            Reindexer(pd.Index(el.fields), pd.Index(list(common_keys))) for el in els
+        ]
     else:
         min_ind = min(el.shape[alt_axis] for el in els)
         reindexers = [
@@ -596,10 +642,38 @@ def gen_outer_reindexers(els, shapes, new_index: pd.Index, *, axis=0):
             else (lambda _, shape=shape: pd.DataFrame(index=range(shape)))
             for el, shape in zip(els, shapes)
         ]
-    else:
-        # if fill_value is None:
-        # fill_value = default_fill_value(els)
+    elif any(isinstance(el, AwkArray) for el in els if not_missing(el)):
+        import awkward as ak
 
+        if not all(isinstance(el, AwkArray) for el in els if not_missing(el)):
+            raise NotImplementedError(
+                "Cannot concatenate an AwkwardArray with other array types."
+            )
+        warn(
+            "Outer joins on awkward.Arrays will have different return values in the future."
+            "For details, and to offer input, please see:\n\n\t"
+            "https://github.com/scverse/anndata/issues/898",
+            ExperimentalFeatureWarning,
+        )
+        filterwarnings(
+            "ignore",
+            category=ExperimentalFeatureWarning,
+            message=r"Outer joins on awkward.Arrays will have different return values.*",
+        )
+        # all_keys = union_keys(el.fields for el in els if not_missing(el))
+        reindexers = []
+        for el in els:
+            if not_missing(el):
+                reindexers.append(lambda x: x)
+            else:
+                reindexers.append(
+                    lambda x: ak.pad_none(
+                        ak.Array([]),
+                        len(x),
+                        0,
+                    )
+                )
+    else:
         max_col = max(el.shape[1] for el in els if not_missing(el))
         orig_cols = [el.shape[1] if not_missing(el) else 0 for el in els]
         reindexers = [