-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch back to 2D-only arrays (WIP) #55
Conversation
I'll have time again during the next couple of days - am just returning from my wedding weekend. I'm opposing this change. Automatic flattening wasn't a mistake but made a lot of code a lot simpler. This is why numpy arrays do it. If we go back to 2d arrays many people's code will break. |
your wedding was awesome, thank you again for inviting me! regarding this one, i’m sad that we didn’t talk about your change before scanpy 1.0. i very deliberately decided to always leave the backing storage 2D when I implemented it. numpy has the ndarray (n-D array). we have the 2D anndata. slicing it will (even now) always return a 2D anndata, with a backing storage that’s either always 2D (sparse matrices) or can degrade to 1D or even 0D. that’s confusing! backing storage should always have the same shape as the anndata itself. i don’t think putting in anndata is not 1.0 yet but obviously people won’t install scanpy with an older anndata version. we could say |
Sure, I'm glad you came! 😄 I made the change already more than a year ago when anndata was still a part of Scanpy: d1f28b8 It's been in the docs for more than a year:
Scanpy 1.0 was released in April. It's unfortunate that you weren't aware of the change.
I know, we even talked about it! This concerns a discussion we had in March 2017. I had many discussions about this in April and May 2017 with other users and made the change probably around that time. The consequence was that a lot of
always returned a dense one-dimensional vector and no longer a 2d array. By that, AnnData became essentially as convenient as a dataframe, where the call is
and returns a Series object. I acknowledge that one might have better introduced another access operation for retrieving a 1d array column, but before getting into that, there are also further arguments. AnnData is a container object of data and annodations. To me, data is always a tensor where the first axis labels the observations. The second axis starts labeling the dimensions of the observations. For structured data, these are scalars and a simple label (like a gene name) is enough. But now consider spatial transcriptomics, where you have say a 2d array of measurements for each gene. Then Having multiple observations of a scalar is most conventiently represented as a column vector. And not a 2d array. As the whole point of AnnData is convenient access to data and annotations, I'm strongly advocating to keep things as they are. I have not yet tried to fix the bug @mckinsel discovered (#42). But I'm sure it's easy to fix. I don't know which problems @M0hammadL had, but I'm sure they are easy to fix, too. Why are you using the notion "backing storage"? Let's stick with the notion data matrix. In the future, this will be a data tensor. Even more so, as people use anndata with tensorflow, which is already the case. What do you think? |
The core of my argument was that this behavior is confusing and error-prone (as evidenced by the two mentioned bugs in scanpy: #42 & #56). They’re both caused because subsetting AnnData objects can result in the mentioned change of shape, which proves my point: People (evidently even the authors of scanpy) can’t remember this edge-case behavior and are bitten by it in a weird and hard-to-trace way: Experienced users might learn what the weird error message means after a while (oh, that one again!), but newbies will always have to waste (their and/or our) time when encountering it. Storing higher-dimensional data is besides the point, let me generalize my argument: If we have a ND tensor, it should stay ND and not degrade to N-1 or N-2 dimensions when subsetting observations and/or variables. Actually I think that’s a great argument for my opinion: Things will get even more confusing once we’d start having e.g. a 2D tensor only be interpretable when cross-referencing its shape with the AnnData’s shape. You also didn’t address my other argument: that only ndarrays degrade, not sparse matrices. They always stay 2D. This introduces further inconsistency, and limits the utility of not having to use “flatten” to dense arrays. Therefore I think we should have gone with the option you mentioned: Another API for getting rid of one or the other dimension. |
I'm sorry that you missed the change 1.5 years ago. It astonishes me that you weren't bitten by it earlier.
Array's were designed automatically flatten upon slicing because many people consider this highly intuitive behavior, also me. Not only for 2d-1d but also for 4d>3d (slicing a gene in the the spatial case, for instance). Also, in the high-dimensional case,
That's not true. Slicing a variable from an AnnData that has a sparse I completely agree that the code becomes simpler if one doesn't cast |
OK, you do the lion’s share of all the coding and bug fixing, so of course you have the last word. |
Codecov Report
@@ Coverage Diff @@
## master #55 +/- ##
=======================================
Coverage 65.19% 65.19%
=======================================
Files 9 9
Lines 727 727
=======================================
Hits 474 474
Misses 253 253
Continue to review full report at Codecov.
|
I'm adapting this such that We can then go and see how we move from here: #60 (comment) One option would be to replace to an accessor |
Ok, I'm merging this now. This now also fixes #42. We can look at #56 in a new pull request. Should be easy to address now. Same for the discussion #60 (comment) and the comment about |
We have always either no data or 2D data. Representing this super flexibly as scalar or 1D array is error-prone, and making it one explicitly is one
.flatten()
away for the user.Fixes #42, fixes #56