Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MuData API considerations #383

Closed
grst opened this issue Feb 20, 2023 · 2 comments · Fixed by #356
Closed

MuData API considerations #383

grst opened this issue Feb 20, 2023 · 2 comments · Fixed by #356

Comments

@grst
Copy link
Collaborator

grst commented Feb 20, 2023

Description of feature

In the course of implementing the new data structure (#327), I plan to make MuData the default way
of interacting with paired single-cell gene expression/AIRR data.

I'm thinking about how the API should be adapted for this.

Data structure recap

We are talking about a MuData object that looks like this:

MuData object with n_obs × n_vars = 3000 × 30727
  2 modalities
    gex:	3000 x 30727
      obs:	'cluster_orig', 'patient', 'sample', 'source'
      uns:	'cluster_orig_colors'
      obsm:	'X_umap_orig'
    airr:	3000 x 0
      obs:	'high_confidence', 'is_cell', 'clonotype_orig'
      obsm:	'airr', 'chain_indices'

The gex modality contains the gene expression data, the airr modaility the
receptor data. The airr modality has no .X, the relevant data are stored in .obsm.

  • Most scirpy functions only operate on the airr modality.
  • Some functions use both airr and gex data.
  • For visualization, it is useful to plot airr.obs on top of gex embeddings, or use columns from both gex.obs and airr.obs in a single plot.

Since the airr modality only has obs and obsm, it would be thinkable to
(additionally) support the use of a single AnnData object with gene expression datain .X and receptor data in .obsm.

API consideration for unimodal data

(i.e. scirpy functions that only use the airr modality)

1. For a function that only operates on the AIRR data, what is the preferred option to interact with mudata?

ir.tl.chain_qc(mdata, airr_key="airr", **kwargs)

or

ir.tl.chain_qc(mdata['airr'], **kwargs)

2. Should a function that only operates on the AIRR data add columns to mdata or adata?

def chain_qc(mdata, airr_key="airr", **kwargs):
    adata = mdata[airr_key]
    adata.obs["new_col"] = np.zeros((adata.n_obs, ))
    # should this be called by the function automatically? 
    mdata.update_obs()

3. Use muon for plotting or scanpy?

Is it preferable to call

mu.pl.umap(mdata, color="gex:cluster")

or

sc.pl.umap(mdata['gex'], color="cluster")

If the former, is there a recommended way to transfer .obsm from the GEX AnnData to MuData (similar to update_obs for .obs)?

API considerations for multimodal data

(i.e. functions that consume both the airr and gex modalities)

I have a function that depends on a gene expression neighborhood graph and .obs annotations based on AIRR data.

API options

  1. pass both modalities (probably not)
    ir.tl.clonotype_modularity(mdata['gex'], mdata['airr'], airr_col="clone_id")
  2. pass mdata and mod_keys
    ir.tl.clonotype_modularity(mdata, gex_mod="gex", airr_col="airr:clone_id")
  3. Store the gene expression neighborhood graph in mudata
    # is there something like mdata.update_obsm() ? 
    mdata.obsp["connectivities"] = mdata["gex"].obsp["connectivities"]
    ir.tl.clonotype_modularity(mdata, airr_col="airr:clone_id")

Possible solution

I'm leaning towards having all functions operate on MuData directly,
i.e.

ir.tl.something(mdata, airr_key="airr", col="airr:xxx")

with the option to also pass an anndata object for backwards-compatibility (in that case, airr_key will be ignored).

ir.tl.something(adata, col="xxx")
@grst
Copy link
Collaborator Author

grst commented Feb 20, 2023

@gtca, would be great to get your input on this!

@grst grst mentioned this issue Feb 20, 2023
48 tasks
@gtca
Copy link

gtca commented Feb 24, 2023

Hey!

I'll try to briefly comment on this below (and I'm happy to catch up later to discuss it further).

API consideration for unimodal data

  1. For a function that only operates on the AIRR data, what is the preferred option to interact with mudata?

The way it is addressed in muon for example is supporting both. Directly referring to the mdata['airr'] AnnData object might be cleaner / simpler — but then in many cases it would be enough to write ir.tl.chain_qc(mdata) due to the defaults, which is nice (will require having an additional parameter though).

  1. Should a function that only operates on the AIRR data add columns to mdata or adata?

I would say adata. And calling the .update_obs() inside might have some unintended consequences — e.g. if this "synchronisation" hasn't been performed before, more things will happen than just copying one column. (My current thinking is that only functions that are going to break without .update() can run it internally with some log trace.) Plus we might also reconsider in future if we should always copy the columns with updates by default.

  1. Use muon for plotting or scanpy?

I think mu.pl starts being very useful when information from different modalities is used. I am not sure if this is makes more sense semantically to have X_umap in mdata['gex'].obsm or in mdata.obsm but generally, mu.pl.embedding(mdata, basis="gex:X_umap", color="gex:cluster") should address this point.

For the last point in the original post, I think the proposed solution is a reasonable one.

Exciting to see this taking shape!

@grst grst closed this as completed in #356 Apr 7, 2023
@grst grst added this to scirpy-dev May 28, 2024
@grst grst moved this to Done in scirpy-dev May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants