Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement scverse datastucture #356

Merged
merged 173 commits into from
Apr 7, 2023
Merged

Implement scverse datastucture #356

merged 173 commits into from
Apr 7, 2023

Conversation

grst
Copy link
Collaborator

@grst grst commented Aug 13, 2022

Implementing the changes suggested in #327.

Close #327
Close #184
Close #383

Requires

  • awkward_array >=2.0.8
  • Anndata >=0.9 installed from the val_shape branch

TODO

  • store index pointing to primary and secondary chains somehwere
  • update the _check_upgrade_schema() wrappers to detect the 0.7 < x <= 0.11 data structure.
  • phase out support for the scirpy <0.7 data structure. The additional effort maintaining this is not worth it.
  • do i still need has_ir and multichain columns?
    -> NO. Get rid of them everywhere.
  • call chain indices independent of IO
    • multi chain must be called at this point because it depends on the strategy (e.g. ignore non-productive chains?)
    • Should chain_qc be called implicitly here? No good reason to change that. Also it adds to obs and that should not happen implicitly.
  • We don't need multichain labels anymore at import time. Actually we only really need it for the chain_qc function.
  • There should not be string representations of "nan", "None", "True", "False" in adata.obsm["airr"]. These special cases must be caught beforehand (it loading/saving doesn't interfere, it should definitely be possible to rely on the data being cleaned during IO)

  • first-class mudata support -> this should just work, but we need to document this.
  • can we get rid of the merge_adata functions? (-> We got rid of merge_with_ir, and replace merge_airr_chains with merge_airr. There is probably some functionality to expand (e.g. add airr chains on IR object with different dimensions), but we'll see how that works out in practice (see 'documentation' section of this checklist).

Spatial & bulk ready

  • Store version of receptor model in adata.uns

For spatial data (visium) we may have spots instead of cells. That means our usual receptor model doesn't fit.
While I am not going to implement all required changes for that in this PR, it would be good if the data structure
could already be used without backwards-incompatible changes for that.

  • consider changing .obsm["chain_indices"] to an awkward array with
[
    {"VJ": [0], "VDJ": [1], "multichain": False}, # single pair
    {"VJ": [0, 2], "VDJ": [1,3], "multichain": False}, # dual IR
    {"VJ": [1,2,3,6,7,8,12,14], "VDJ": None, "multichain": True}, # other indexing strategy for bulk/visium 
]

  • update all tests and test data
  • edge cases:
    • empty anndata objects (gives reasonable error)
    • missing cdr3 sequences, ...
  • How to deal with cells that have no receptors (i.e. obsm['chain_indices'] is NaN) versus cells that have a receptor, but no CDR3 sequences. Currently (also in the old implementation) they are treated differently (the former receives 'nan' as a clonotype, the latter a separate clonotype

  • consistent use of airr_key / chain_idx_key in all functions?
    • maybe better to not index_chains by default during IO, but instead perform it on-the-fly when missing, as done elsewhere.
  • conversion from old schema to new datastructure
    • It might be better to have a warning message that says that .obsm[airr_key] is missing instead of complaining that the schema is outdated, as this is only one possibility why the key is missing. Or change the check to having IR_VJ... columns in adata.obs. But those could also have been added manually 🤷
  • useful error messages if airr or chain_indices is missing from obsm
  • Do API considertions still hold when there are cells with AIRR but no GEX, or cells with GEX but no AIRR?
  • Checks: Search for occurences of get_airr, get.airr, "chain_indices", _has_ir.
  • Check internal functions that take adata as attribute
  • Not all functions need the _ParamsCheck. In particular, some functions only consume obs and mudata/anndata can be used interchangeably with the corresponding keys: gex:xxx, airr:xxx.
  • update example data in scirpy.datasets
  • add tests that everything works with both MuData and AnnData
  • Check systematically where inplace operations write to

Get module

  • Do we provide a context manager for obs?
  • Tests for context managers
  • [ ] Do we want a most_frequent function or others?

Documentation

  • changelog (ideally covering changes unrelated to the datastructure)
  • Go through tutorials
  • Go through docstrings
  • everything in API docs that should be there?
  • Mudata support and highlighted as recommended way of working with AIRR + GEX data
  • concatenation and merging:
# Merging AIRR with gex data
# 1. MuData (recommended)
TODO
# 2. assigning `.obsm["airr"]`
adata_gex.obsm["airr"] = adata.obsm["airr"]
ir.pp.index_chains(adata_gex)
# (in both cases, `obs_names` need to be the same)
  • changed behavior of merge_airr_chains
    -> this function now returns an AnnData object and doesn't modify inplace anymore. Also it removes all non-airr information.
  • Support of non-IGMT loci (given the new modularization, this should be easy. Maybe still not very often requested, but add to receptor model and open an issue!)
  • revisit IO constraints section in the docs (we are probably going to address all of these issues implicitly in this PR)
  • By buildin an awkward array of record types, the keys become the outer join of all individual chain dictionaries. Missing fields are represented with an option type and not distinguishable from a missing value.
  • Breaking change: API update of pl.spectratype and tl.spectratype
  • changed behavior of how it is dealt with missing chains during clonotype calling (0456bff)

Final checks:

  • keep public functions that were not deprecated before and add stub functions
  • search for occurence of IR_ / IR_VJ / IR_VDJ just in case something is not covered by tests
  • search for occurence of has_ir, multichain
  • glossary for awkward array
  • search for TODOs

Follow-up

  • dandelion shared data structure --> open Issue

@grst grst mentioned this pull request Oct 9, 2022
@grst grst marked this pull request as ready for review March 29, 2023 09:35
@grst grst merged commit d8ec147 into master Apr 7, 2023
@grst grst deleted the scverse_datastructure branch April 7, 2023 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MuData API considerations scverse datastructure for AIRR data Easy-access getter functions
1 participant