Switch back to 2D-only arrays (WIP) #55

flying-sheep · 2018-08-27T15:06:03Z

We have always either no data or 2D data. Representing this super flexibly as scalar or 1D array is error-prone, and making it one explicitly is one .flatten() away for the user.

Fixes #42, fixes #56

falexwolf · 2018-08-27T15:46:22Z

I'll have time again during the next couple of days - am just returning from my wedding weekend.

I'm opposing this change. Automatic flattening wasn't a mistake but made a lot of code a lot simpler. This is why numpy arrays do it. If we go back to 2d arrays many people's code will break.

flying-sheep · 2018-08-27T16:19:30Z

your wedding was awesome, thank you again for inviting me!

regarding this one, i’m sad that we didn’t talk about your change before scanpy 1.0. i very deliberately decided to always leave the backing storage 2D when I implemented it.

numpy has the ndarray (n-D array). we have the 2D anndata.

slicing it will (even now) always return a 2D anndata, with a backing storage that’s either always 2D (sparse matrices) or can degrade to 1D or even 0D. that’s confusing! backing storage should always have the same shape as the anndata itself.

i don’t think putting in .flatten() a few times makes users’ code noticably more complex. however we currently have at least two bugs in our own code caused by that behavior. We probably have more, and users probably too.

anndata is not 1.0 yet but obviously people won’t install scanpy with an older anndata version. we could say anndata~=0.7.0 in the requirements and release anndata 0.8 with the new 2D behavior, but that means scanpy 1.0 users will never get anndata 0.9 with whatever other improvements that may bring.

falexwolf · 2018-08-28T21:15:01Z

Sure, I'm glad you came! 😄

I made the change already more than a year ago when anndata was still a part of Scanpy: d1f28b8

It's been in the docs for more than a year:

A data matrix is flattened if either n_obs or n_vars is 1, so that numpy’s slicing behavior is reproduced:

adata = AnnData(np.ones((2, 2)))
adata[:, 0].X == adata.X[:, 0]

Scanpy 1.0 was released in April. It's unfortunate that you weren't aware of the change.

i very deliberately decided to always leave the backing storage 2D when I implemented it.

I know, we even talked about it! This concerns a discussion we had in March 2017. I had many discussions about this in April and May 2017 with other users and made the change probably around that time.

The consequence was that a lot of .flattens() disappeared from the user's codes. The prototypical call of

adata[:, 'Gata1'].X

always returned a dense one-dimensional vector and no longer a 2d array. By that, AnnData became essentially as convenient as a dataframe, where the call is

df['Gata1']

and returns a Series object.

I acknowledge that one might have better introduced another access operation for retrieving a 1d array column, but before getting into that, there are also further arguments.

AnnData is a container object of data and annodations. To me, data is always a tensor where the first axis labels the observations. The second axis starts labeling the dimensions of the observations. For structured data, these are scalars and a simple label (like a gene name) is enough. But now consider spatial transcriptomics, where you have say a 2d array of measurements for each gene. Then .X will be a 4d tensor.

Having multiple observations of a scalar is most conventiently represented as a column vector. And not a 2d array. As the whole point of AnnData is convenient access to data and annotations, I'm strongly advocating to keep things as they are.

I have not yet tried to fix the bug @mckinsel discovered (#42). But I'm sure it's easy to fix. I don't know which problems @M0hammadL had, but I'm sure they are easy to fix, too.

Why are you using the notion "backing storage"? Let's stick with the notion data matrix. In the future, this will be a data tensor. Even more so, as people use anndata with tensorflow, which is already the case.

What do you think?

flying-sheep · 2018-08-29T08:40:40Z

The core of my argument was that this behavior is confusing and error-prone (as evidenced by the two mentioned bugs in scanpy: #42 & #56). They’re both caused because subsetting AnnData objects can result in the mentioned change of shape, which proves my point: People (evidently even the authors of scanpy) can’t remember this edge-case behavior and are bitten by it in a weird and hard-to-trace way: TypeError: 'numpy.int64' object is not iterable doesn’t help at all!

Experienced users might learn what the weird error message means after a while (oh, that one again!), but newbies will always have to waste (their and/or our) time when encountering it.

Storing higher-dimensional data is besides the point, let me generalize my argument: If we have a ND tensor, it should stay ND and not degrade to N-1 or N-2 dimensions when subsetting observations and/or variables. Actually I think that’s a great argument for my opinion: Things will get even more confusing once we’d start having e.g. a 2D tensor only be interpretable when cross-referencing its shape with the AnnData’s shape.

You also didn’t address my other argument: that only ndarrays degrade, not sparse matrices. They always stay 2D. This introduces further inconsistency, and limits the utility of not having to use “flatten” to dense arrays.

Therefore I think we should have gone with the option you mentioned: Another API for getting rid of one or the other dimension.

falexwolf · 2018-09-03T10:04:13Z

People (evidently even the authors of scanpy) can’t remember this edge-case behavior and are bitten by it in a weird and hard-to-trace way: TypeError: 'numpy.int64' object is not iterable doesn’t help at all!

I'm sorry that you missed the change 1.5 years ago. It astonishes me that you weren't bitten by it earlier.

Storing higher-dimensional data is besides the point, let me generalize my argument: If we have a ND tensor, it should stay ND and not degrade to N-1 or N-2 dimensions when subsetting observations and/or variables. Actually I think that’s a great argument for my opinion: Things will get even more confusing once we’d start having e.g. a 2D tensor only be interpretable when cross-referencing its shape with the AnnData’s shape.

Array's were designed automatically flatten upon slicing because many people consider this highly intuitive behavior, also me. Not only for 2d-1d but also for 4d>3d (slicing a gene in the the spatial case, for instance). Also, in the high-dimensional case, .obs and .var won't have that simple correspondence to the axes of .X anymore: .obs will match the first index and .var will correspond to the "interpretable" variables or to a collection of variable axes but for sure not to pixels.

You also didn’t address my other argument: that only ndarrays degrade, not sparse matrices. They always stay 2D. This introduces further inconsistency, and limits the utility of not having to use “flatten” to dense arrays.

That's not true. Slicing a variable from an AnnData that has a sparse .X has been returning a dense (1d) vector for more than one 1 year. Don't know why you missed it. This feature has added a lot of convenience for myself when working with AnnData.

I completely agree that the code becomes simpler if one doesn't cast .X to different dimensions. And probably a "convenience accessor" instead of casting .X would have been a better solution. But it's way too late for the change and also with the present solution, it's definitely not a lot of work to fix the remaining bugs. Also for the future, I don't see fundamental problems, just a bit more bookkeeping work to be done. After 1.5 years of bug reports, I don't think there are any problems for the users: both #42 and #56 absolutely are non-canonical use cases. That's why I'm taking to so much time with fixing them.

flying-sheep · 2018-09-03T11:26:12Z

OK, you do the lion’s share of all the coding and bug fixing, so of course you have the last word.

codecov · 2018-09-26T01:10:45Z

Codecov Report

Merging #55 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master      #55   +/-   ##
=======================================
  Coverage   65.19%   65.19%           
=======================================
  Files           9        9           
  Lines         727      727           
=======================================
  Hits          474      474           
  Misses        253      253

Impacted Files	Coverage Δ
anndata/readwrite/read.py	`64.68% <100%> (-0.4%)`	⬇️
anndata/readwrite/write.py	`69.34% <100%> (+0.5%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b35a0e7...821e055. Read the comment docs.

falexwolf · 2018-09-26T01:13:42Z

I'm adapting this such that ._X actually always remains 2d, but .X still get's flattened upon slicing. Then the whole thing is backwards compat.

We can then go and see how we move from here: #60 (comment)

One option would be to replace to an accessor .values that would always be 2d and deprecate .X in the long run. But this will be another pull request...

falexwolf · 2018-09-26T02:32:06Z

Ok, I'm merging this now. ._X is now always 2d. .X is presented to the user flattened if slicing a single dimension.

This now also fixes #42. We can look at #56 in a new pull request. Should be easy to address now.

Same for the discussion #60 (comment) and the comment about .values above.

WIP 2D-only arrays. 1D was a mistake.

c8ab8f3

flying-sheep changed the title ~~Switch back to 2D-only arrays~~ Switch back to 2D-only arrays (WIP) Aug 27, 2018

flying-sheep mentioned this pull request Aug 27, 2018

Can’t concat single-observation anndatas #56

Closed

flying-sheep closed this Sep 3, 2018

falexwolf mentioned this pull request Sep 19, 2018

Error slicing when there is only one observation #60

Closed

falexwolf reopened this Sep 26, 2018

falexwolf added 3 commits September 25, 2018 21:21

Merge branch 'master' into 2d

eda9d89

return flattened X in property

1b67050

fixed tests

1d750b8

fixes issue #44

821e055

falexwolf mentioned this pull request Sep 26, 2018

Failure reading single observation h5ads #42

Closed

falexwolf merged commit c6698ab into master Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch back to 2D-only arrays (WIP) #55

Switch back to 2D-only arrays (WIP) #55

flying-sheep commented Aug 27, 2018 •

edited

Loading

falexwolf commented Aug 27, 2018

flying-sheep commented Aug 27, 2018 •

edited

Loading

falexwolf commented Aug 28, 2018 •

edited

Loading

flying-sheep commented Aug 29, 2018 •

edited

Loading

falexwolf commented Sep 3, 2018

flying-sheep commented Sep 3, 2018

codecov bot commented Sep 26, 2018 •

edited

Loading

falexwolf commented Sep 26, 2018 •

edited

Loading

falexwolf commented Sep 26, 2018 •

edited

Loading

Switch back to 2D-only arrays (WIP) #55

Switch back to 2D-only arrays (WIP) #55

Conversation

flying-sheep commented Aug 27, 2018 • edited Loading

falexwolf commented Aug 27, 2018

flying-sheep commented Aug 27, 2018 • edited Loading

falexwolf commented Aug 28, 2018 • edited Loading

flying-sheep commented Aug 29, 2018 • edited Loading

falexwolf commented Sep 3, 2018

flying-sheep commented Sep 3, 2018

codecov bot commented Sep 26, 2018 • edited Loading

Codecov Report

falexwolf commented Sep 26, 2018 • edited Loading

falexwolf commented Sep 26, 2018 • edited Loading

flying-sheep commented Aug 27, 2018 •

edited

Loading

flying-sheep commented Aug 27, 2018 •

edited

Loading

falexwolf commented Aug 28, 2018 •

edited

Loading

flying-sheep commented Aug 29, 2018 •

edited

Loading

codecov bot commented Sep 26, 2018 •

edited

Loading

falexwolf commented Sep 26, 2018 •

edited

Loading

falexwolf commented Sep 26, 2018 •

edited

Loading