Implement scverse datastucture #356

grst · 2022-08-13T17:52:48Z

Implementing the changes suggested in #327.

Close #327
Close #184
Close #383

Requires

awkward_array >=2.0.8
Anndata >=0.9 ~~installed from the val_shape branch~~

TODO

first-class mudata support -> this should just work, but we need to document this.
can we get rid of the merge_adata functions? (-> We got rid of merge_with_ir, and replace merge_airr_chains with merge_airr. There is probably some functionality to expand (e.g. add airr chains on IR object with different dimensions), but we'll see how that works out in practice (see 'documentation' section of this checklist).

Spatial & bulk ready

Store version of receptor model in adata.uns

For spatial data (visium) we may have spots instead of cells. That means our usual receptor model doesn't fit.
While I am not going to implement all required changes for that in this PR, it would be good if the data structure
could already be used without backwards-incompatible changes for that.

consider changing .obsm["chain_indices"] to an awkward array with

[
    {"VJ": [0], "VDJ": [1], "multichain": False}, # single pair
    {"VJ": [0, 2], "VDJ": [1,3], "multichain": False}, # dual IR
    {"VJ": [1,2,3,6,7,8,12,14], "VDJ": None, "multichain": True}, # other indexing strategy for bulk/visium 
]

update all tests and test data
edge cases:
- empty anndata objects (gives reasonable error)
- missing cdr3 sequences, ...
How to deal with cells that have no receptors (i.e. obsm['chain_indices'] is NaN) versus cells that have a receptor, but no CDR3 sequences. Currently (also in the old implementation) they are treated differently (the former receives 'nan' as a clonotype, the latter a separate clonotype

Get module

Do we provide a context manager for obs?
Tests for context managers
~~[ ] Do we want a most_frequent function or others?~~

Documentation

changelog (ideally covering changes unrelated to the datastructure)
Go through tutorials
Go through docstrings
everything in API docs that should be there?
Mudata support and highlighted as recommended way of working with AIRR + GEX data
concatenation and merging:

# Merging AIRR with gex data
# 1. MuData (recommended)
TODO
# 2. assigning `.obsm["airr"]`
adata_gex.obsm["airr"] = adata.obsm["airr"]
ir.pp.index_chains(adata_gex)
# (in both cases, `obs_names` need to be the same)

changed behavior of merge_airr_chains
-> this function now returns an AnnData object and doesn't modify inplace anymore. Also it removes all non-airr information.
Support of non-IGMT loci (given the new modularization, this should be easy. Maybe still not very often requested, but add to receptor model and open an issue!)
revisit IO constraints section in the docs (we are probably going to address all of these issues implicitly in this PR)

By buildin an awkward array of record types, the keys become the outer join of all individual chain dictionaries. Missing fields are represented with an option type and not distinguishable from a missing value.
Breaking change: API update of pl.spectratype and tl.spectratype

changed behavior of how it is dealt with missing chains during clonotype calling (0456bff)

Final checks:

keep public functions that were not deprecated before and add stub functions
search for occurence of IR_ / IR_VJ / IR_VDJ just in case something is not covered by tests
search for occurence of has_ir, multichain
glossary for awkward array
search for TODOs

Follow-up

dandelion shared data structure --> open Issue

for more information, see https://pre-commit.ci

This reverts commit 6e19241.

Will fail, because anndata 0.9rc1 is not on conda.

grst added 11 commits August 13, 2022 19:49

Create Awkward AnnData instead of putting everything in obs

a34afc1

add todo

e65c138

Get chain indices for primary and secondary chains

e1646b7

WIP get module

799f611

Implement ir.get.airr

da0b096

Clean up AirrCell

1a3e4d7

WIP restructure IO module

c7280b5

fix imports

6b36de0

Add helper function for unit tests

d13c254

tl.chain_qc successfully runs on the new datastructure

f3d82fb

Update convert anndata

966d17d

grst mentioned this pull request Oct 9, 2022

Scalability to >1M cells #370

Open

grst added 18 commits October 10, 2022 15:51

Merge branch 'master' into scverse_datastructure

4e20248

Merge branch 'master' into scverse_datastructure

6be7e85

switch to obsm-based data structure

423f447

update get module

6924d2a

Update anndata schema check and _make_adata util function.

bd6bb86

fix _make_adata

7fe9ac9

update fixtures

2dadc53

Fix a couple of tests

b906fce

Re-add to_airr_cells

fe150e0

Fix couple more tests

5ba4891

Fix more IO tests

bc859c4

More IO tests [skip ci]

cad4733

Merge remote-tracking branch 'origin/master' into scverse_datastructure

08341a4

Cleanup has_ir

6204a09

WIP fix clonotype neighbors [skip ci]

6ccf781

WIP fix distance tests

9a091f2

WIP fix clonotype cluster tests

5a185d1

Fix spectratype functions [skip ci]

3ba6051

Update MuData section [skip ci]

17e7a17

grst force-pushed the scverse_datastructure branch from d60ad64 to 17e7a17 Compare March 25, 2023 11:03

grst added 6 commits March 28, 2023 17:20

WIP update IO tutorial

984441e

Update IO tutorial

bf2b11b

Update datastructure section with info about single AnnData object

6284f6f

Update main tutorial

e75dde5

Update API docs page

c7aaf1d

Minor doc amendments

e31484c

grst marked this pull request as ready for review March 29, 2023 09:35

grst and others added 19 commits March 29, 2023 13:21

WIP update docstrings

d81c4ad

Fix docstrings

0a288ed

Fix TODOs

c278c05

Fix sphinx warnings

be67fab

update isort

ce30a82

[pre-commit.ci] auto fixes from pre-commit.com hooks

8ad2e8b

for more information, see https://pre-commit.ci

constrain pandas

cd0a9fc

Pandas workarounds

6e19241

Revert "Pandas workarounds"

0c3640e

This reverts commit 6e19241.

pandas version

718125b

Fix problem with color by gene in clonotype_network

bcc5a09

fix missing import in datasets

f1831d3

cancel previous CI jobs automaticallY

ccf2604

test ci

b0e6807

Concurrency should be outside 'jobs'

70d7bea

test ci

04a9062

Merge remote-tracking branch 'origin/master' into scverse_datastructure

945c328

Update dependencies

3df7f5f

Update conda dependencies

7c80137

Will fail, because anndata 0.9rc1 is not on conda.

grst merged commit d8ec147 into master Apr 7, 2023

grst deleted the scverse_datastructure branch April 7, 2023 06:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement scverse datastucture #356

Implement scverse datastucture #356

grst commented Aug 13, 2022 •

edited

Loading

Implement scverse datastucture #356

Implement scverse datastucture #356

Conversation

grst commented Aug 13, 2022 • edited Loading

grst commented Aug 13, 2022 •

edited

Loading