Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment with xarray #744

Open
1 of 5 tasks
ivirshup opened this issue Mar 23, 2022 · 18 comments
Open
1 of 5 tasks

Alignment with xarray #744

ivirshup opened this issue Mar 23, 2022 · 18 comments

Comments

@ivirshup
Copy link
Member

ivirshup commented Mar 23, 2022

I'm opening this issue to track and discuss how our data structure differs from xarray. Ideally I would close it when AnnData could easily be implemented via xarray.

Some previous discussion: #308

The idea

I often think of AnnData as a kind of "special case" of xarray Datasets. We just improve convenience by specializing on the 2d case, plus a few other features. It would be nice if I didn't just think of it that way, and we could actually just use their code here.

sgkit basically accomplishes this. It basically uses a very "anndata shaped"1 xarray Dataset2 for representing genomics data. These data structures and our goals with them are so similar that searching for open issues by the sgkit devs on the xarray repository is a great way to find compatibility issues for anndata.

Additionally, zarr and OME-zarr are quite aligned with xarray.

What's missing

Some things we need, which xarray does not currently provide:

Footnotes

  1. Since we're in the same language, working with biological data, and using many of the same technologies it would make a lot of sense for us to have greater alignment with sgkit.

  2. More context: https://github.com/single-cell-data/matrix-api/issues/11#issuecomment-1072533371

@jakirkham
Copy link

cc @jpivarski (who may be interested in the Awkward Array connection)

@jpivarski
Copy link

Supporting Awkward Arrays would likely prevent full reimplementation of anndata with xarray alone, since xarrays can't contain Awkward Arrays or vice-versa. Even the "tree-like data structure" on xarray's road map (experimentally implemented by Datatree), is not quite the same thing, as Datatrees are more like nested groups in an HDF file (as seen in these docs): a small number of nested objects, which can each be large. Awkward Arrays represent a large number of nested objects. The comparison is like "AoS vs SoA" (just an analogy). This comment, pydata/xarray#4118 (comment), seems to be spelling out out the difference, and I'm following up with the author on scikit-hep/awkward#1396.

As a side note, it looks like there could be some benefit to xarrays containing Awkward Arrays (and not the other way around). That's something I should probably ask the xarray developers someday. Datatree is extending Dataset in a bigger way than it would probably take to wrap an Awkward Array.

Unless/until we actually do that, implementation of anndata with xarray would have to have some way to handle the fact that Awkward Arrays are not included within xarray's data model.

@ivirshup
Copy link
Member Author

ivirshup commented Apr 6, 2022

Supporting Awkward Arrays would likely prevent full reimplementation of anndata with xarray alone, since xarrays can't contain Awkward Arrays or vice-versa.
...
As a side note, it looks like there could be some benefit to xarrays containing Awkward Arrays

My mental model here was a 1d xr.DataArray containing an ak.Array. This seem's fairly doable to me since you really only need labels -> positional indices. Figuring out the merging/ concatenation semantics here could take some more doing, but also strikes me as possible.

Random thought: storing an arrow ListArray inside an xr.DataArray could get us part way here.

@jpivarski
Copy link

Can you put Arrow data in xarray? Arrow is interchangeable with Awkward Array, so having Arrow can be seen as equivalent to having Awkward. The ak.to_arrow and ak.from_arrow functions are usually zero-copy, too. If that's already a possibility, it's more than part way there.


The main way in which Awkward Arrays differ from all the other array types is that Awkward Arrays do not have shape and dtype. (Same for Arrow arrays, for the same reason.) That's usually the first thing that we find when we attempt to put Awkward Arrays into Pandas or Dask naively. It's also why we can't participate in the Python array API standard.

A single ak.Array can be split apart into a small number of buffers of different sizes, each of which can be an xr.DataArray, along with some metadata to put them back again. That was the idea for using Awkward Array in Zarr: one ak.Array becomes one Zarr group of datasets. Since xarray Datatree is like Zarr and HDF5 groups, one ak.Array could be decomposed into a Datatree using ak.to_buffers and reconstituted using ak.from_buffers.

@jakirkham
Copy link

jakirkham commented Apr 6, 2022

The main way in which Awkward Arrays differ from all the other array types is that Awkward Arrays do not have shape and dtype. (Same for Arrow arrays, for the same reason.) That's usually the first thing that we find when we attempt to put Awkward Arrays into Pandas or Dask naively. It's also why we can't participate in the Python array API standard.

Bit of a tangent, but it might be worthwhile to write up a Data Array API issue about the Awkward Array use case.

@jpivarski
Copy link

Bit of a tangent, but it might be worthwhile to write up a Data Array API issue about the Awkward Array use case.

We already talked about it here: data-apis/consortium-feedback#6. It sounded pretty clear that Awkward (and by extension, Arrow) are out of scope for Data Array API, and it's understandable that the scope would have to cut off somewhere.

@SimonHeybrock
Copy link

If anyone is looking for more confusion, I'd like to mention scipp, and in particular its Binned data feature. This is somewhat similar to a DataArray containing an Awkward Array of records. Happy to share more info if someone is interested.

@ivirshup
Copy link
Member Author

ivirshup commented Jun 7, 2022

@SimonHeybrock, thanks for pointing that out! From my initial look, the API for scipp looks quite nice. It does seem to cater to some use-cases we're looking at more than the more geospatial focus of xarray.

However, I really like that xarray can hold various types of python arrays. For instance, sparse arrays are very important to us – and I'd expect dask will become important as well.

@SimonHeybrock
Copy link

@ivirshup The two things you point out (holding other Python arrays, dask support) are indeed somewhat sore points for us. We would like to do both, but currently have no funding to do so.

We have serialization compatible with dask, so a number of the dask multi-processing APIs can be used, but we do not have an implementation of the dask collections interface, i.e., we currently do not support chunking and operations in the style of xarray's dask support.

@ilan-gold
Copy link
Contributor

Another potential ask here: not reading the dims (like indices of a dataframe) into memory Dataset declaration.

@scverse scverse deleted a comment from github-actions bot Aug 1, 2023
@jhamman
Copy link

jhamman commented Sep 27, 2023

👋 Hi folks! Xarray dev here. Just wanted to drop a note to say that we'd be happy to help move this issue forward if/when it becomes a priority. We've been making lots of progress toward flexible indexes and array backends that I assume would be of interest here.

@ivirshup
Copy link
Member Author

Hey @jhamman! I think it's pretty close to becoming a priority. Figuring out how heavy of a lift sparse arrays will be is the main thing here. Could you point me to any recent developments around array backends? Are we even talking like a-couple-hours-ago recent?

@dcherian
Copy link

dcherian commented Sep 27, 2023

Yes "couple of hours" recent. We will refactor out that NamedArray piece over the next couple of months to a new library with minimal dependencies (no pandas!) and support for any array API (+ other array protocols) compliant object.

Please read the design doc and let us know what you think. Your input will be very valuable!

Figuring out how heavy of a lift sparse arrays will be is the main thing here.

pydata/sparse is supported. scipy.sparse needs to become array API compliant (which I think is on the cards? you'll know more!). Bottom line is we want to support any standards-conforming array library.


From the list in your initial post though, it seems like NamedArray isn't entirely what you want.

  • For hierarchies you'd want datatree (as noted), but that pulls xarray, which will pull pandas.
  • We haven't considered repeated dims yet, but I bet we could support some set of reasonable cases.
  • Categorical variables are interesting. Again, if there was some array standard compliant container, we'd want to be able to wrap that too.

@ilan-gold
Copy link
Contributor

ilan-gold commented Sep 28, 2023

@dcherian You can see here roughly what we have working at the moment for categoricals: https://github.com/scverse/anndata/pull/947/files#diff-3593f379977a83708f011798996a4e97ec3cf87f11055e3f93651a9718ae4db2R34 We also have something for nullable data types as well. Feedback welcome!

@jpivarski
Copy link

Follow up on this topic at scikit-hep/ragged#6

@grst
Copy link
Contributor

grst commented Jan 3, 2024

Just as a note, the scope of the ragged library does not cover what we are currently doing in scirpy (heavy use of RecordTypes), nor for what @Zethson is planning in ehrapy (arbitrary nesting). So we'd likely need support for the full awkward array anyway.

@jpivarski
Copy link

Right—sorry for the confusion. If all the conversations linked to the new one, this one is perhaps the least related. I know that you've used missing data and even unions, which will not be supported by the ragged library.

Also, it's no minor thing that you've adapted AnnData to use Awkward: the work has been done. I think the users of the Ragged library would be wanting to make smaller changes to adopt something that looks like a normal array.

@grst
Copy link
Contributor

grst commented Jan 3, 2024

All good! Thanks for keeping us in the loop of that discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

9 participants