Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decision: Could .X = None? #464

Closed
ivirshup opened this issue Nov 24, 2020 · 8 comments
Closed

Decision: Could .X = None? #464

ivirshup opened this issue Nov 24, 2020 · 8 comments

Comments

@ivirshup
Copy link
Member

ivirshup commented Nov 24, 2020

I'd like X to be optional. I mean, it already is, but apparently more thorough testing is needed to verify that it works properly.

What is the opinion of the major anndata developers? 👍 or 👎?

If we agree X should be able to be None, we can use this PR to add tests and fix the code where necessary.

Originally posted by @rcannood in #463 (comment)

@ivirshup
Copy link
Member Author

ivirshup commented Nov 24, 2020

Some additional context

This has come up previously in #330. It seems that at some point this may have worked, but it's not covered in many tests.

Why

If X is just another layer, why not? It would be a more flexible model. If X is optional, the only truly necessary attributes would be obs_names and var_names.

As mentioned by @rcannood (#463 (comment)), you could leave X out of the saved data – storing just annotations. Now you could store different sets of annotations separately, then merge the elements you want together (#441 (comment)).

I think this could be a nice way to structure the output of operations when inplace=None. It could return an object with a similar structure, but only the resulting set of elements. The end of sc.pp.neighbors could look like:

result = AnnData(
    obs=input.obs[[]],  # Just the index
    var = input.var[[]],
    obsp={"distances": distances, "connectivities": connectivities},
    uns={"neighbors": metadata}
)
if inplace:
    input.update(result)
else:
    return result

Why not

I think most code assumes that X exists. That's certainly true for some code in anndata. There would probably need to be updates in scanpy to handle this case as well.

My main worry is that this touches the X property, and the AnnData initializer, which I think are the roughest parts of the codebase.

Current status

What currently works

You can currently instantiate an AnnData object with X = None (relevant code).

a = AnnData(shape=(10, 20))

This can be read and written.

For Raw, passing X=None seems to mean X is either backed, or should be copied from the parent. This constructor is pretty weird though.

What doesn't

Copying and raw for sure. adata.X = None and del adata.X error as well, but the linked PR has fixes. Just search the codebase for .X and you'll find code that doesn't handle this case (e.g. concat, .chunk_X, .transpose). For elements in layers, this is handled by throwing key errors (when a key is specified) or iterating over the extant elements.

While it turns out we do have some tests around this case, we don't have much. I think we should be able to parameterize existing tests around this though.

Miscellaneous thoughts

Ping: @Koncopd @gokceneraslan @falexwolf @flying-sheep

@flying-sheep
Copy link
Member

flying-sheep commented Nov 24, 2020

I’m in favor. I agree that since X is just the default layer, that means that the list of layers should also be able to be empty.

For Raw, we can use the empty pattern to take the place of the default, right? With None just meaning None:

# utils.py
class Empty(Enum):
    token = auto()

# _core/raw.py
def __init__(
    self,
    ...
    X: Union[np.ndarray, sparse.spmatrix, None, Empty] = Empty.token,
    ...
): ...

Regarding initializers: I think it’s really hard to write them well, as you can’t depend on most things being there yet, so most assumptions in methods are broken, and one needs to be clever, arrange things carefully, and probably still repeat some code. The only fix I can come up with is “composition instead of inheritance”: Break an object down into independent parts which are created individually and then combined. Python isn’t the ideal language for it.

@ivirshup
Copy link
Member Author

@flying-sheep

I agree that since X is just the default layer, that means that the list of layers should also be able to be empty.

Unless we actually store X in layers and have an attribute saying which layer corresponds to X, I think X will remain special. That would be a big API change, though could be closer to what SingleCellExperiment and loom do. It would also make backed modification harder.

For Raw, we can use the empty pattern to take the place of the default, right? With None just meaning None:

Isn't None sufficient here? Backed mode can be handled the same way as AnnData, e.g. an .is_backed property. I think the issue with raw is more that it's initializer is broken. See #449 for some examples. Basically it assumes the only way it's going to be called by a user is adata.raw = adata. In most other cases, it's going to do the wrong thing.

Regarding initializers: I think it’s really hard to write them well, as you can’t depend on most things being there yet.

I think we also have an issue of handling too many cases in one function right now. I think it would be reasonable to split out backed initialization into a seperate function, as well as I also think this could even make this even more simple. After setting .obs_names and .var_names, you've got shape. The rest of the attributes can just be assigned through their respective setter methods.

@ivirshup ivirshup added this to the 0.8 milestone Nov 25, 2020
@LuckyMD
Copy link

LuckyMD commented Nov 25, 2020

Another argument for adata.X being allowed to be None is that there was just a preprint posted to BioRxiv about insufficient cellular metadata being stored in GEO for scRNA-seq submissions. AnnData might be a good standard format for metadata storage also in GEO?

@ivirshup
Copy link
Member Author

There has also been broad agreement with this on slack (@rcannood did you want to be invited to that btw?). I'd be happy with aiming for this in 0.8.

@rcannood, how much of this were you interested in helping with?

Of the top of my head, these are the features we'll need check:

  • Getter, setter, deleter
  • Initializer (seems to work, needs testing)
  • Views (basically just check that .X = None doesn't change other view behavior, a[idx].X doesn't do anything weird, and copying views still works)
  • _subset_inplace
  • Copying
  • Backed mode
  • Concatenation
  • Transpose
  • Methods which default to values from X, e.g. obs_vector, to_df.
  • IO (seems to work, needs testing)
    • Maybe specialized writers like to_csvs will need some attention here?
  • Raw (pretty similar set of things to be checked)
    • Initializer
    • Backed
    • Copies
    • Concatenation
    • IO

It's a long list, but I'm being granular. These should each be quite small tasks. Ideally many of these can be covered by a refactor of the tests.

@ivirshup
Copy link
Member Author

@LuckyMD I would be thrilled if GEO started having some standards for reporting beyond data being in SRA.

@rcannood
Copy link
Contributor

rcannood commented Nov 26, 2020

I can take a crack at it, as long as it's reviewed by a core anndata developer :) I already have small changes in #463. Do you want to create a separate feature/x_is_none branch for this? Or do we continue from the PR I created earlier?

@ivirshup
Copy link
Member Author

ivirshup commented Nov 27, 2020

That sounds great!

Do you want to create a separate feature/x_is_none branch for this?

I was thinking this should be split into multiple PRs, since many of these are orthogonal. Plus it's much easier to review and merge small PRs. Let's keep that PR's scope to just fixing up the .X property.


I'm gonna mark this as resolved and start tracking what needs to be done in #467

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants