Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove inplace argument in scanpy 2.0 #2583

Open
grst opened this issue Aug 1, 2023 · 7 comments
Open

Remove inplace argument in scanpy 2.0 #2583

grst opened this issue Aug 1, 2023 · 7 comments

Comments

@grst
Copy link
Contributor

grst commented Aug 1, 2023

What kind of feature would you like to request?

Other?

Please describe your wishes

Another point that I'd like to throw into the scanpy 2.0 discussion:

Right now, many functions have the inplace argument, that determines if a function should write back to adata or return the result instead.

With this behavior it is hard to make correct type hints. While it is possible with @overload, it is cumbersome because it requires to type out the entire function signature twice. When I asked if it is possible to write these overloads in a more concise way on stackoverflow several users argued that changing the return type based on an argument is an anti-pattern, and I think they convinced me.

Alternative approach

Have two API levels, e.g.

def scanpy.tl.pca(adata: AnnData, **kwargs) -> None: ...

and

def scanpy.lowlevel.tl.pca(data: np.ndarray | sp.spmatrix, n_pcs) -> np.ndarray: ...

Where the former is a wrapper for the latter. This allows to separate the implementation of the actual method using only numpy/scipy data types from the scverse-specific behavior.

@grst
Copy link
Contributor Author

grst commented Aug 3, 2023

@flying-sheep suggested to write a hatch-plugin that can automatically create the respective @overload type hints

@ivirshup
Copy link
Member

ivirshup commented Aug 3, 2023

Notes from discussion with @Intron7, @flying-sheep, @gtca, and @grst at hackathon:

  • Updated idea from @gtca, based on: [RFC] Make .X a layer anndata#706

    • Use layer_to, layer_fromas argument. Has possibility to still do operations inplace on arrays if you only pass layer_from
    • Could be useful to have semantically meaningful default arguments e.g. layers_from="counts"``layers_to="normalized"
  • Returning a new AnnData object with only new arrays could be a flexible base, as discussed in ad.merge function anndata#658

    • inplace=False returning function specific types (sometimes an array, sometimes a dict of arrays) is bad. This would be an alternative.
  • Being able to update arrays inplace is still important for memory usage

  • Lots of discussion of when/ how we want to modify the AnnData

@flying-sheep
Copy link
Member

flying-sheep commented Aug 3, 2023

Unformatted notes by me:

Behaviors that exist for inplace/copy:

  • update AnnData in place (where appropriate, choose target layer, obsm[key], …)
  • leave original AnnData alone, return
    • new AnnData
    • newly created array

inplace=False/copy=True returning array instead of whole object (AnnData) is confusing, but is sometimes done.
but having a choice to return the array makes sense.

@grst
Copy link
Contributor Author

grst commented Aug 11, 2023

What if we had copy-on-write behavior for AnnData? Then we could never modify AnnData inplace but always return a view of an AnnData with references to the objects that were unchanged and only the new data added.

Pandas seems to be going that route: https://github.com/pandas-dev/pandas/blob/57390ada100466dac777e5b66d5a4f2a72700c38/web/pandas/pdeps/0008-inplace-methods-in-pandas.md (HT @bernheder)

@flying-sheep
Copy link
Member

Our views are a limited version of COW (_init_as_actual).

If I understand the idea correctly, I’d be for it if we could guarantee that all nested data structures delegate modification to us. I don’t know if that’s possible with things like dask arrays and (at the time of writing) pandas data frames.

@gtca
Copy link

gtca commented Aug 11, 2023

@flying-sheep This guarantee seems hard to ask for though?

@grst It would be great but it's unclear how to do that properly for a container-like data structure such as AnnData. E.g. even for data frames for value changes that document already states the following:

In contrast, the inplace parameter will be kept for any methods that only modify the underlying data of a pandas object.

@grst
Copy link
Contributor Author

grst commented Aug 11, 2023

@flying-sheep This guarantee seems hard to ask for though?

Agree it's challenging, but maybe not impossible

@grst It would be great but it's unclear how to do that properly for a container-like data structure such as AnnData.

A couple of examples how I think this would look like:

  • sc.tl.pca adds a key to .varm, to .obsm and to .uns. We create a new AnnData object where every layer, every element of varm, every element of .obsm, .X, .obs, and .var are references to the objects already present in the previous AnnData object. There's now just additional keys in .obsm and .varm pointing to the new objects.
  • assign a new column in obs: adata.obs["key"] = value. Create a new AnnData object where all elements are pointers to the objects that already exist. Only .obs becomes a copy of the previous obs data frame (for objects that support copy on write themselves, like pandas does (or will?), the behavior could be delegated and not even the entire data frame would need to be copied.
  • modify a value in X: adata.X[20, 25] = 42. Again everything would be pointers to the old objects, but X needs to be copied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants