Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch back to 2D-only arrays (WIP) #55

Merged
merged 5 commits into from
Sep 26, 2018
Merged

Switch back to 2D-only arrays (WIP) #55

merged 5 commits into from
Sep 26, 2018

Conversation

flying-sheep
Copy link
Member

@flying-sheep flying-sheep commented Aug 27, 2018

We have always either no data or 2D data. Representing this super flexibly as scalar or 1D array is error-prone, and making it one explicitly is one .flatten() away for the user.

Fixes #42, fixes #56

@flying-sheep flying-sheep changed the title Switch back to 2D-only arrays Switch back to 2D-only arrays (WIP) Aug 27, 2018
@falexwolf
Copy link
Member

I'll have time again during the next couple of days - am just returning from my wedding weekend.

I'm opposing this change. Automatic flattening wasn't a mistake but made a lot of code a lot simpler. This is why numpy arrays do it. If we go back to 2d arrays many people's code will break.

@flying-sheep
Copy link
Member Author

flying-sheep commented Aug 27, 2018

your wedding was awesome, thank you again for inviting me!


regarding this one, i’m sad that we didn’t talk about your change before scanpy 1.0. i very deliberately decided to always leave the backing storage 2D when I implemented it.

numpy has the ndarray (n-D array). we have the 2D anndata.

slicing it will (even now) always return a 2D anndata, with a backing storage that’s either always 2D (sparse matrices) or can degrade to 1D or even 0D. that’s confusing! backing storage should always have the same shape as the anndata itself.

i don’t think putting in .flatten() a few times makes users’ code noticably more complex. however we currently have at least two bugs in our own code caused by that behavior. We probably have more, and users probably too.

anndata is not 1.0 yet but obviously people won’t install scanpy with an older anndata version. we could say anndata~=0.7.0 in the requirements and release anndata 0.8 with the new 2D behavior, but that means scanpy 1.0 users will never get anndata 0.9 with whatever other improvements that may bring.

@falexwolf
Copy link
Member

falexwolf commented Aug 28, 2018

Sure, I'm glad you came! 😄


I made the change already more than a year ago when anndata was still a part of Scanpy: d1f28b8

It's been in the docs for more than a year:

A data matrix is flattened if either n_obs or n_vars is 1, so that numpy’s slicing behavior is reproduced:

adata = AnnData(np.ones((2, 2)))
adata[:, 0].X == adata.X[:, 0]

Scanpy 1.0 was released in April. It's unfortunate that you weren't aware of the change.


i very deliberately decided to always leave the backing storage 2D when I implemented it.

I know, we even talked about it! This concerns a discussion we had in March 2017. I had many discussions about this in April and May 2017 with other users and made the change probably around that time.

The consequence was that a lot of .flattens() disappeared from the user's codes. The prototypical call of

adata[:, 'Gata1'].X

always returned a dense one-dimensional vector and no longer a 2d array. By that, AnnData became essentially as convenient as a dataframe, where the call is

df['Gata1']

and returns a Series object.

I acknowledge that one might have better introduced another access operation for retrieving a 1d array column, but before getting into that, there are also further arguments.


AnnData is a container object of data and annodations. To me, data is always a tensor where the first axis labels the observations. The second axis starts labeling the dimensions of the observations. For structured data, these are scalars and a simple label (like a gene name) is enough. But now consider spatial transcriptomics, where you have say a 2d array of measurements for each gene. Then .X will be a 4d tensor.

Having multiple observations of a scalar is most conventiently represented as a column vector. And not a 2d array. As the whole point of AnnData is convenient access to data and annotations, I'm strongly advocating to keep things as they are.


I have not yet tried to fix the bug @mckinsel discovered (#42). But I'm sure it's easy to fix. I don't know which problems @M0hammadL had, but I'm sure they are easy to fix, too.


Why are you using the notion "backing storage"? Let's stick with the notion data matrix. In the future, this will be a data tensor. Even more so, as people use anndata with tensorflow, which is already the case.


What do you think?

@flying-sheep
Copy link
Member Author

flying-sheep commented Aug 29, 2018

The core of my argument was that this behavior is confusing and error-prone (as evidenced by the two mentioned bugs in scanpy: #42 & #56). They’re both caused because subsetting AnnData objects can result in the mentioned change of shape, which proves my point: People (evidently even the authors of scanpy) can’t remember this edge-case behavior and are bitten by it in a weird and hard-to-trace way: TypeError: 'numpy.int64' object is not iterable doesn’t help at all!

Experienced users might learn what the weird error message means after a while (oh, that one again!), but newbies will always have to waste (their and/or our) time when encountering it.

Storing higher-dimensional data is besides the point, let me generalize my argument: If we have a ND tensor, it should stay ND and not degrade to N-1 or N-2 dimensions when subsetting observations and/or variables. Actually I think that’s a great argument for my opinion: Things will get even more confusing once we’d start having e.g. a 2D tensor only be interpretable when cross-referencing its shape with the AnnData’s shape.

You also didn’t address my other argument: that only ndarrays degrade, not sparse matrices. They always stay 2D. This introduces further inconsistency, and limits the utility of not having to use “flatten” to dense arrays.

Therefore I think we should have gone with the option you mentioned: Another API for getting rid of one or the other dimension.

@falexwolf
Copy link
Member

People (evidently even the authors of scanpy) can’t remember this edge-case behavior and are bitten by it in a weird and hard-to-trace way: TypeError: 'numpy.int64' object is not iterable doesn’t help at all!

I'm sorry that you missed the change 1.5 years ago. It astonishes me that you weren't bitten by it earlier.

Storing higher-dimensional data is besides the point, let me generalize my argument: If we have a ND tensor, it should stay ND and not degrade to N-1 or N-2 dimensions when subsetting observations and/or variables. Actually I think that’s a great argument for my opinion: Things will get even more confusing once we’d start having e.g. a 2D tensor only be interpretable when cross-referencing its shape with the AnnData’s shape.

Array's were designed automatically flatten upon slicing because many people consider this highly intuitive behavior, also me. Not only for 2d-1d but also for 4d>3d (slicing a gene in the the spatial case, for instance). Also, in the high-dimensional case, .obs and .var won't have that simple correspondence to the axes of .X anymore: .obs will match the first index and .var will correspond to the "interpretable" variables or to a collection of variable axes but for sure not to pixels.

You also didn’t address my other argument: that only ndarrays degrade, not sparse matrices. They always stay 2D. This introduces further inconsistency, and limits the utility of not having to use “flatten” to dense arrays.

That's not true. Slicing a variable from an AnnData that has a sparse .X has been returning a dense (1d) vector for more than one 1 year. Don't know why you missed it. This feature has added a lot of convenience for myself when working with AnnData.


I completely agree that the code becomes simpler if one doesn't cast .X to different dimensions. And probably a "convenience accessor" instead of casting .X would have been a better solution. But it's way too late for the change and also with the present solution, it's definitely not a lot of work to fix the remaining bugs. Also for the future, I don't see fundamental problems, just a bit more bookkeeping work to be done. After 1.5 years of bug reports, I don't think there are any problems for the users: both #42 and #56 absolutely are non-canonical use cases. That's why I'm taking to so much time with fixing them.

@flying-sheep
Copy link
Member Author

OK, you do the lion’s share of all the coding and bug fixing, so of course you have the last word.

@codecov
Copy link

codecov bot commented Sep 26, 2018

Codecov Report

Merging #55 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #55   +/-   ##
=======================================
  Coverage   65.19%   65.19%           
=======================================
  Files           9        9           
  Lines         727      727           
=======================================
  Hits          474      474           
  Misses        253      253
Impacted Files Coverage Δ
anndata/readwrite/read.py 64.68% <100%> (-0.4%) ⬇️
anndata/readwrite/write.py 69.34% <100%> (+0.5%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b35a0e7...821e055. Read the comment docs.

@falexwolf
Copy link
Member

falexwolf commented Sep 26, 2018

I'm adapting this such that ._X actually always remains 2d, but .X still get's flattened upon slicing. Then the whole thing is backwards compat.

We can then go and see how we move from here: #60 (comment)

One option would be to replace to an accessor .values that would always be 2d and deprecate .X in the long run. But this will be another pull request...

@falexwolf
Copy link
Member

falexwolf commented Sep 26, 2018

Ok, I'm merging this now. ._X is now always 2d. .X is presented to the user flattened if slicing a single dimension.

This now also fixes #42. We can look at #56 in a new pull request. Should be easy to address now.

Same for the discussion #60 (comment) and the comment about .values above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can’t concat single-observation anndatas Failure reading single observation h5ads
2 participants