MuData API considerations #383

grst · 2023-02-20T06:48:38Z

Description of feature

In the course of implementing the new data structure (#327), I plan to make MuData the default way
of interacting with paired single-cell gene expression/AIRR data.

I'm thinking about how the API should be adapted for this.

Data structure recap

We are talking about a MuData object that looks like this:

MuData object with n_obs × n_vars = 3000 × 30727
  2 modalities
    gex:	3000 x 30727
      obs:	'cluster_orig', 'patient', 'sample', 'source'
      uns:	'cluster_orig_colors'
      obsm:	'X_umap_orig'
    airr:	3000 x 0
      obs:	'high_confidence', 'is_cell', 'clonotype_orig'
      obsm:	'airr', 'chain_indices'

The gex modality contains the gene expression data, the airr modaility the
receptor data. The airr modality has no .X, the relevant data are stored in .obsm.

Most scirpy functions only operate on the airr modality.
Some functions use both airr and gex data.
For visualization, it is useful to plot airr.obs on top of gex embeddings, or use columns from both gex.obs and airr.obs in a single plot.

Since the airr modality only has obs and obsm, it would be thinkable to
(additionally) support the use of a single AnnData object with gene expression datain .X and receptor data in .obsm.

API consideration for unimodal data

(i.e. scirpy functions that only use the airr modality)

1. For a function that only operates on the AIRR data, what is the preferred option to interact with mudata?

ir.tl.chain_qc(mdata, airr_key="airr", **kwargs)

or

ir.tl.chain_qc(mdata['airr'], **kwargs)

2. Should a function that only operates on the AIRR data add columns to mdata or adata?

def chain_qc(mdata, airr_key="airr", **kwargs):
    adata = mdata[airr_key]
    adata.obs["new_col"] = np.zeros((adata.n_obs, ))
    # should this be called by the function automatically? 
    mdata.update_obs()

3. Use muon for plotting or scanpy?

Is it preferable to call

mu.pl.umap(mdata, color="gex:cluster")

or

sc.pl.umap(mdata['gex'], color="cluster")

If the former, is there a recommended way to transfer .obsm from the GEX AnnData to MuData (similar to update_obs for .obs)?

API considerations for multimodal data

(i.e. functions that consume both the airr and gex modalities)

I have a function that depends on a gene expression neighborhood graph and .obs annotations based on AIRR data.

API options

pass both modalities (probably not)

ir.tl.clonotype_modularity(mdata['gex'], mdata['airr'], airr_col="clone_id")

pass mdata and mod_keys

ir.tl.clonotype_modularity(mdata, gex_mod="gex", airr_col="airr:clone_id")

Store the gene expression neighborhood graph in mudata

# is there something like mdata.update_obsm() ? 
mdata.obsp["connectivities"] = mdata["gex"].obsp["connectivities"]
ir.tl.clonotype_modularity(mdata, airr_col="airr:clone_id")

Possible solution

I'm leaning towards having all functions operate on MuData directly,
i.e.

ir.tl.something(mdata, airr_key="airr", col="airr:xxx")

with the option to also pass an anndata object for backwards-compatibility (in that case, airr_key will be ignored).

ir.tl.something(adata, col="xxx")

The text was updated successfully, but these errors were encountered:

grst · 2023-02-20T06:48:55Z

@gtca, would be great to get your input on this!

gtca · 2023-02-24T08:32:27Z

Hey!

I'll try to briefly comment on this below (and I'm happy to catch up later to discuss it further).

API consideration for unimodal data

For a function that only operates on the AIRR data, what is the preferred option to interact with mudata?

The way it is addressed in muon for example is supporting both. Directly referring to the mdata['airr'] AnnData object might be cleaner / simpler — but then in many cases it would be enough to write ir.tl.chain_qc(mdata) due to the defaults, which is nice (will require having an additional parameter though).

Should a function that only operates on the AIRR data add columns to mdata or adata?

I would say adata. And calling the .update_obs() inside might have some unintended consequences — e.g. if this "synchronisation" hasn't been performed before, more things will happen than just copying one column. (My current thinking is that only functions that are going to break without .update() can run it internally with some log trace.) Plus we might also reconsider in future if we should always copy the columns with updates by default.

Use muon for plotting or scanpy?

I think mu.pl starts being very useful when information from different modalities is used. I am not sure if this is makes more sense semantically to have X_umap in mdata['gex'].obsm or in mdata.obsm but generally, mu.pl.embedding(mdata, basis="gex:X_umap", color="gex:cluster") should address this point.

For the last point in the original post, I think the proposed solution is a reasonable one.

Exciting to see this taking shape!

grst mentioned this issue Feb 20, 2023

Implement scverse datastucture #356

Merged

48 tasks

grst closed this as completed in #356 Apr 7, 2023

grst added this to scirpy-dev May 28, 2024

grst moved this to Done in scirpy-dev May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MuData API considerations #383

MuData API considerations #383

grst commented Feb 20, 2023 •

edited

Loading

grst commented Feb 20, 2023

gtca commented Feb 24, 2023

MuData API considerations #383

MuData API considerations #383

Comments

grst commented Feb 20, 2023 • edited Loading

Description of feature

Data structure recap

API consideration for unimodal data

API considerations for multimodal data

Possible solution

grst commented Feb 20, 2023

gtca commented Feb 24, 2023

grst commented Feb 20, 2023 •

edited

Loading