-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first attempt to support awkward arrays #647
Conversation
Codecov Report
@@ Coverage Diff @@
## master #647 +/- ##
==========================================
+ Coverage 83.12% 83.21% +0.08%
==========================================
Files 34 34
Lines 4416 4503 +87
==========================================
+ Hits 3671 3747 +76
- Misses 745 756 +11
|
I doubt it, also can't think of any scanpy function that could use such representation out of the box. We'd have multiple use cases in squidpy though! Essentially being able to slice/subset/concatenate and copy anndata preserving the 0-axis of an akward array would cover most fo the cases I can think of right now. |
@giovp, for Could you modify |
yeah you can't have multi-dimensonal akward array, but I think it would still be good to concatenate them across axis=0, and so this should be supported and hence escape the current alterante_axes check?
I will do that and run test locally, sorry for that |
Ahh yeah, this is what I meant. Basically have a case for all elements being awkward arrays. |
Ok, at this stage concat works both for inner and outer on obsm, and subsetting works for varm. Exampleimport numpy as np
import scanpy as sc
import squidpy as sq
import matplotlib.pyplot as plt
from numpy.random import default_rng
from sklearn.datasets import make_blobs
import awkward as ak
import pandas as pd
from cycler import cycler
sc.set_figure_params()
adata = sq.datasets.visium_hne_adata()
varm = adata.obsm["spatial"][15:30, :]
adata = adata[:10, :15].copy()
adata.obsm["spatial"] = (
adata.obsm["spatial"] - np.std(adata.obsm["spatial"], 0)
) / np.mean(adata.obsm["spatial"], 0)
adata.varm["spatial"] = varm.copy()
adata.varm["spatial"] = (
adata.varm["spatial"] - np.std(adata.varm["spatial"], 0)
) / np.mean(adata.varm["spatial"], 0)
obs_list = []
var_list = []
rng = default_rng(42)
for idx in adata.obs_names.values:
coord, _ = make_blobs(
n_samples=rng.integers(5, 15),
cluster_std=0.02,
centers=adata[idx].obsm["spatial"],
random_state=42,
)
obs_list.append(coord)
for idx in adata.var_names.values:
coord, _ = make_blobs(
n_samples=rng.integers(5, 15),
cluster_std=0.02,
centers=adata[:, idx].varm["spatial"],
random_state=42,
)
var_list.append(coord)
sub_obs = ak.Array(obs_list)
sub_var = ak.Array(var_list)
def plot_points(adata, main: np.ndarray, sub: ak.Array, axis: int, cmap_name):
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[1].set_prop_cycle(
cycler("color", plt.get_cmap(cmap_name)(np.linspace(0, 1, len(main))))
)
ax[0].axis("equal")
ax[0].scatter(
x=main[:, 0], y=main[:, 1], c="grey", edgecolors="black", s=100, linewidths=1
)
for i in range(adata.shape[axis]):
ax[1].scatter(
x=sub[i, :, 0],
y=sub[i, :, 1],
edgecolors="black",
alpha=0.7,
)
ax[1].axis("equal")
return
plot_points(adata, adata.obsm["spatial"], sub_obs, 0, "winter")
plot_points(adata, adata.varm["spatial"], sub_var, 1, "cool") adata.obsmadata.varmSlicing and copying also works adata.obsm["sub_obs"] = sub_obs # sub_obs is an awkward array
adata.varm["sub_var"] = sub_var
adata_subset = adata[:5, :7] # let's subset If I run test locally with |
I think you just need to add awkward array to the testing dependencies |
Also, for |
🤦 🤦 🤦 So I'm at stage where I could fix+add tests for awkward array for ages and am happy to do so but would like to get a sense of how welcomed this PR is. As it stands, we'd have to include awkward array as a (optional) dependency and it would be a non-negligible addition to the code base (although I expected worse, with singledispatch made it quite slim). A major thing that could be impactful is that there is no Overall, I think we could use it right away in Squidpy, mostly as a way to enable anndata slicing and indexing while retaining sub-obs or sub-var info. We would not really use it to do any arithmetics or stuff like that, although it would be cool to come up with function ideas and also the fact that it support numba and jax jitting is cool. So, to summarize, shall I go ahead? @ivirshup @hspitzer @Zethson @michalk8 ? |
I'm all for it. Having more record like data has been requested a number of times. I would note that there are some things in the awkward array api that will change (scikit-hep/awkward#1151), but that's mostly down the line stuff.
I believe it does have |
This fixed a number of tests because we had a 1d awkward array being generated, and we currently don't support 1d arrays in obsm well. Tracked in #652.
…the arrays weren't broadcastable.
@grst tests are passing! I think we need to open some issues on behavior we want to change, like the unions from outer concatenation. I would also like to take a look over the coverage, and see what we're missing. How's the tutorial going? |
I can finish that beginning of next week |
Congratulations!!! This is not an ordinary PR. |
First draft at supporting awkward arrays.
As discussed with @ivirshup this would be useful for Squidpy @hspitzer (discussed also here #609 ) and potentially @Zethson EHR project.
Here's a walkthrough showcasing some basic functionality we could use for sub observation annotations (e.g. spatial coordinates of rna-molecules/segmentations).
Details
What fails atm is
adata.concatenate()
because of errors in reindexing of alternate axishttps://github.com/theislab/anndata/blob/286bc7f207863964cb861f17c96ab24fe0cf72ac/anndata/_core/merge.py#L478
in awkward arrays there is no
shape
attribute so whenarray.shape[0]
is needed we can resort tolen(awkward_array)
but forarray.shape[1]
we should (probably) simply skip it and concatenate. Inawkward
it would be like this:TODO
v2
api of awkward. v1 EOL is in four months and bugs are not getting fixed already now- [ ] address newly added TODOs in the codeTests
dim_len
function.uns
adata.obsm["x"] = awk
)[ ]What's the right fill value for whenjoin=outer
anda.obsm["x"]
exists butb.obsm["x"]
doesn'tmerge="equal"
(add more tests for the equality function)Docs
[ ] add examples