Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error slicing when there is only one observation #60

Closed
bretttully opened this issue Sep 11, 2018 · 13 comments
Closed

Error slicing when there is only one observation #60

bretttully opened this issue Sep 11, 2018 · 13 comments
Labels

Comments

@bretttully
Copy link

This may be intentional, but there seems to be an issue when the raw data only has one row. We have come across this in our unit tests when creating a small example data set with only one observation.

In [1]: import anndata

In [2]: anndata.__version__
Out[2]: '0.6.10'

In [3]: import pandas as pd

In [4]: d = [
   ...:     (1, 'A', 'a', 'Z', 'z'),
   ...:     (2, 'A', 'b', 'Z', 'z'),
   ...:     (3, 'B', 'c', 'Z', 'z'),
   ...:     (4, 'B', 'd', 'Z', 'z'),
   ...: ]

In [5]: df = pd.DataFrame(d, columns='c0 c1 c2 c3 c4'.split())
   ...: df
Out[5]:
   c0 c1 c2 c3 c4
0   1  A  a  Z  z
1   2  A  b  Z  z
2   3  B  c  Z  z
3   4  B  d  Z  z

In [6]: df = df.set_index('c1 c2 c3 c4'.split())['c0'].unstack(level=[2, 3]).T
   ...: df
Out[6]:
c1     A     B
c2     a  b  c  d
c3 c4
Z  z   1  2  3  4

In [7]: def convert_idx(x): return x.to_frame().reset_index(drop=True)

In [8]: obs = convert_idx(df.index)

In [9]: var = convert_idx(df.columns)

In [10]: a = anndata.AnnData(X=df.values, obs=obs, var=var)

In [11]: a
Out[11]:
AnnData object with n_obs × n_vars = 1 × 4
    obs: 'c3', 'c4'
    var: 'c1', 'c2'

In [12]: a.obs
Out[12]:
  c3 c4
0  Z  z

In [13]: a.var
Out[13]:
  c1 c2
0  A  a
1  A  b
2  B  c
3  B  d

In [14]: assert a.shape == a.X.shape, '{} != {}'.format(a.shape, a.X.shape)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-14-7f38676dce05> in <module>()
----> 1 assert a.shape == a.X.shape, '{} != {}'.format(a.shape, a.X.shape)

AssertionError: (1, 4) != (4,)

In [15]: a[0, 0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-15-0d768f50cb80> in <module>()
----> 1 a[0, 0]

~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in __getitem__(self, index)
   1292     def __getitem__(self, index):
   1293         """Returns a sliced view of the object."""
-> 1294         return self._getitem_view(index)
   1295
   1296     def _getitem_view(self, index):

~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in _getitem_view(self, index)
   1296     def _getitem_view(self, index):
   1297         oidx, vidx = self._normalize_indices(index)
-> 1298         return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
   1299
   1300     # this is used in the setter for uns, if a view

~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
    674             if not isinstance(X, AnnData):
    675                 raise ValueError('`X` has to be an AnnData object.')
--> 676             self._init_as_view(X, oidx, vidx)
    677         else:
    678             self._init_as_actual(

~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in _init_as_view(self, adata_ref, oidx, vidx)
    735         # set data
    736         if self.isbacked: self._X = None
--> 737         else: self._init_X_as_view()
    738
    739         self._layers = AnnDataLayers(self, adata_ref=adata_ref, oidx=oidx, vidx=vidx)

~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in _init_X_as_view(self)
    750             self._X = None
    751             return
--> 752         X = self._adata_ref.X[self._oidx, self._vidx]
    753         if len(X.shape) == 2:
    754             n_obs, n_vars = X.shape

IndexError: too many indices for array
@falexwolf
Copy link
Member

This is intentional and the same behavior as of numpy arrays, see

adata = AnnData(np.ones((2, 2)))
adata[:, 0].X == adata.X[:, 0]

as mentioned in the docs: https://anndata.readthedocs.io/en/latest/anndata.AnnData.html

Just recently we had a discussion of whether this helpful or not. I personally think it's helpful. What do you think.

@bretttully
Copy link
Author

I can see why in your example this behaviour is good, but in the two errors in the original post it is very confusing.

Without knowing more, my intuition was:

  • the shape of the AnnData and the shape of the underlying X should be the same
  • if the size of .obs is >=1 and size of .var is >=1 I would expect the be able to slice the AnnData object with a 2D index.

Generally, I've come across scenarios in numpy and pandas where applying the same function on an array will return different 'shape' results based on the input. With pandas, there are times when it remains as a DataFrame or can switch to a Series. These are pretty annoying to deal with as a library author, but it probably indicates that these edge cases are hard... With numpy, the reshape function with a -1 is normally an easy way out, so maybe you could surface this with AnnData?

@falexwolf
Copy link
Member

Yes, very happy to add a AnnData.reshape function!

Also happy to change the output of AnnData.shape to (,1) if .X gets flattened.

That's an interesting point that you make about pandas. I know that

df = pd.DataFrame(np.ones(3, 2))
df.loc[0, :]
df.loc[:, 0]

returns a Series - hence a 1d object - in both of the last lines; so this is consistent with slicing a numpy array and an AnnData. As for numpy arrays, consider it highly convenient that a DataFrame behaves this way. Slicing out a single vector is one of the most common operations that you apply to a DataFrame and I'd consider it highly bothersome if one had to .flatten() a (1, n)-shaped object every time.

Why is it bothersome to you the other way round?

@bretttully
Copy link
Author

Here is the anti-pattern we are trying to avoid for our users:

anndata_obj = a_function_filtering_our_db_based_on_user_params(**kwargs)
if anndata_obj.X.ndim == 1:
    first_obs_data = anndata_obj[:]
else:
    first_obs_data = anndata_obj[0, :]

If we can deal with the if statement inside our a_function_filtering_our_db_based_on_user_params() function so that we promise our users they can always index with combination of obs and var with an AnnData object returned by our API.

Thinking about the reshape function, it really doesn't need the power of numpy's version (in fact, that might be actively bad, as the shape of X may get out of sync with n_obs and n_var). Really, the method is enforce_2d(flag: bool) -- perhaps this sets a flag in the metadata of the object so that it is persisted through the hdf backend?

@falexwolf
Copy link
Member

Makes sense! If it's only about retaining indexing with var and obs indices, this can easily be enabled even if AnnData's with 1d X are allowed (also right now, the AnnData knows whether the 1d vector is an obs or a var).

But I see that an .enforce_2d(flag: bool) could be useful in several instances. I'm sure you'd like to avoid your user having to call this explicitly. So, do you want, additionally to a method .enforce_2d() an enforce_2d parameter in the constructor?

I'd tend to call it force_2d if that's not bad English (we have a force_sparse parameter somewhere else, which I could alternatively rename to enforce_sparse).

@bretttully
Copy link
Author

But I see that an .enforce_2d(flag: bool) could be useful in several instances. I'm sure you'd like to avoid your user having to call this explicitly. So, do you want, additionally to a method .enforce_2d() an enforce_2d parameter in the constructor?

This would be useful in the constructor, yes -- would you try and persist that when creating a view/copy of the object? I'm ambivalent as to how that should work.

I'd tend to call it force_2d

No worries, not bad English at all

@falexwolf
Copy link
Member

This would be useful in the constructor, yes -- would you try and persist that when creating a view/copy of the object? I'm ambivalent as to how that should work.

Of course, this would persist throughout copies and views. Wouldn't be a problem.

Just yesterday, another question came up regarding efficient extraction of a slice of .X from an AnnData (scverse/scanpy#244 (comment)). As it's likely needed that another accessor is written for this, and as this could serve as a convenience accessor that provides (n x 1) matrices as vectors to users, I can now think of an AnnData whose data matrix is always 2d. (I previously admitted that "another access operation" that would provide 1d vectors would have been the better design choice and @flying-sheep always wanted to keep things 2d: #55)

  • We need to think about how to make this an easy transition for users, many of them have the call adata[:, 'gene_name'].X in their code and expect it to return a vector. While enforcing ._X to be always 2d could be done in a backwards compatible way by calling .flatten in the property .X if ._X is (n x 1), this wouldn't make you happy, right? Your test above would still break.
  • We need an elegant name for the convenience access: I guess I'd call it .slice as .sliceX is ugly. Calling .slice wouldn't slice the whole AnnData but only return a slice of the data matrix potentially returning something flattened.

@flying-sheep: You don't need to go through the whole issue, just consider the two bullets above. What do you think? I guess you would be happier with this

@falexwolf
Copy link
Member

Another option: Instead of letting.X directly point to ._X, we make it a convenience accessor that enables slicing by strings and get's flattened if shaped (n x 1). At the same time, we provide the underlying data ._X (np.ndarray, sp.spmatrix, h5py.Dataset), which cannot be sliced by strings and is always 2d, with a different name (e.g. .X2d).

The big advantage is that this is backwards compatible. But I fear it's super confusing to have access to the data matrix twice in the object...

@bretttully
Copy link
Author

Not withstanding backwards compatibility, I would take a different approach with something akin to the following:

def extract_vars(self, vars: Union[str, List[str]], as_adata: bool = False):
    # most efficient method here...
    result = self[:, vars]
    if not as_adata:
        result = result._X
        if not isinstance(vars, list):
            result = result.flatten()
    return result

That way, you have better encapsulation and if you need to change implementation details like the shape of _X you can do it without impacting users. It also seemed like that other thread was not sure about most efficient way to do things, so perhaps this makes it more explicit?

@bretttully
Copy link
Author

Worth mentioning that this definitely makes users' code more verbose and is not as simple as the pandas df[vars] -- however, having watched people new to pandas get up to speed, the amount of confusion between df[var_name], df[[var_name]], df[vars] and df.loc[:, vars], and when to use which -- and then patterns emerge like df.loc[:, [var_name]].copy() to always get a DataFrame back that you can go on an add more columns. There is a part of me that thinks more verbosity and explicitness can be beneficial.

I'm on the fence, and you guys have better knowledge about how AnnData is currently being used.

@falexwolf
Copy link
Member

falexwolf commented Sep 26, 2018

OK, so now we at least have ._X always 2d. #55 I'm still not fully convinced by your super-verbose solution. ;) Also not by my own.

One other solution: go away completely from .X and transition to .values, which returns a container for arrays, sparse matrices and backing files.

Calling adata[:, 'gene1'].values will always return something 2d. But calling adata.values[:, 'gene1'] will automatically be flattened, as in numpy, pandas, tensorflow...

.X will remain for backwards compat.

What do you think, @flying-sheep (also @bretttully if you want to comment).

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

@github-actions github-actions bot added the stale label Jul 24, 2023
@flying-sheep
Copy link
Member

Fixed by #171

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants