Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read error on TabulaSapients 10X data #22

Open
rasmushenningsson opened this issue Aug 19, 2024 · 9 comments
Open

Read error on TabulaSapients 10X data #22

rasmushenningsson opened this issue Aug 19, 2024 · 9 comments

Comments

@rasmushenningsson
Copy link
Collaborator

Transferred issue from here:
rasmushenningsson/SingleCell10x.jl#5

Hi Rasmus,

Just for fun, I thought I'd try applying SingleCellProjections to a dataset I had on hand, the TabulaSapiens eye .h5ad file. If you want to grab this file yourself, you can find it here: https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219?file=34701970.

However, it seems that load_h5ad isn't particularly happy with this file:

(jl_NPMjKi) pkg> st # Julia 1.11.0-rc2
Status `/tmp/jl_NPMjKi/Project.toml`
  [03d38035] SingleCellProjections v0.4.0

julia> counts = load_counts("/data/gene_variants_data/extracted/tabula_sapiens/TS_Eye.h5ad/TS_Eye.h5ad")
ERROR: unexpected character '\x04' after quoted field at row 250 column 1
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] 
    @ DelimitedFiles ~/.julia/packages/DelimitedFiles/aGcsu/src/DelimitedFiles.jl:727
  [3] readdlm_string(sbuff::String, dlm::Char, T::Type, eol::Char, auto::Bool, optsd::Dict{Symbol, Union{…}})
    @ DelimitedFiles ~/.julia/packages/DelimitedFiles/aGcsu/src/DelimitedFiles.jl:461
  [4] readdlm_auto(input::IOStream, dlm::Char, T::Type, eol::Char, auto::Bool; opts::@Kwargs{})
    @ DelimitedFiles ~/.julia/packages/DelimitedFiles/aGcsu/src/DelimitedFiles.jl:231
  [5] readdlm_auto
    @ ~/.julia/packages/DelimitedFiles/aGcsu/src/DelimitedFiles.jl:231 [inlined]
  [6] readdlm
    @ ~/.julia/packages/DelimitedFiles/aGcsu/src/DelimitedFiles.jl:226 [inlined]
  [7] readdlm
    @ ~/.julia/packages/DelimitedFiles/aGcsu/src/DelimitedFiles.jl:86 [inlined]
  [8] _read10x_features(io::IOStream; delim::Char)
    @ SingleCell10x ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:306
  [9] _read10x_features
    @ ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:305 [inlined]
 [10] #19
    @ ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:315 [inlined]
 [11] #1
    @ ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:57 [inlined]
 [12] open(f::SingleCell10x.var"#1#2"{SingleCell10x.var"#19#20"{@Kwargs{delim::Char}}, Bool}, args::String; kwargs::@Kwargs{})
    @ Base ./io.jl:410
 [13] open
    @ ./io.jl:407 [inlined]
 [14] _open(f::SingleCell10x.var"#19#20"{@Kwargs{delim::Char}}, filename::String)
    @ SingleCell10x ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:55
 [15] #_read10x_features#18
    @ ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:314 [inlined]
 [16] _read10x_features_triplet(filename::String; guess::Function, kwargs::@Kwargs{})
    @ SingleCell10x ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:329
 [17] _read10x_features_triplet
    @ ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:327 [inlined]
 [18] #_read10x_features_autodetect#22
    @ ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:324 [inlined]
 [19] _read10x_features_autodetect
    @ ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:320 [inlined]
 [20] read10x_features(io::String, featuretype::Type; kwargs::@Kwargs{})
    @ SingleCell10x ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:357
 [21] read10x_features
    @ ~/.julia/packages/SingleCell10x/XLhW5/src/fileio.jl:356 [inlined]
 [22] _load10x_metadata
    @ ~/.julia/packages/SingleCellProjections/0yZXZ/src/load.jl:96 [inlined]
 [23] load10x(filename::String; lazy::Bool, var_id::Nothing, var_id_delim::Char, kwargs::@Kwargs{})
    @ SingleCellProjections ~/.julia/packages/SingleCellProjections/0yZXZ/src/load.jl:135
 [24] load10x
    @ ~/.julia/packages/SingleCellProjections/0yZXZ/src/load.jl:130 [inlined]
 [25] #29
    @ ./broadcast.jl:1306 [inlined]
 [26] _broadcast_getindex_evalf
    @ ./broadcast.jl:673 [inlined]
 [27] _broadcast_getindex
    @ ./broadcast.jl:646 [inlined]
 [28] getindex
    @ ./broadcast.jl:605 [inlined]
 [29] copy
    @ ./broadcast.jl:906 [inlined]
 [30] materialize
    @ ./broadcast.jl:867 [inlined]
 [31] load_counts(loadfun::typeof(load10x), filenames::String; sample_names::Nothing, sample_name_col::Nothing, lazy::Bool, lazy_merge::Bool, obs_id_col::String, obs_id_delim::Char, obs_id_prefixes::Nothing, extra_var_id_cols::Symbol, duplicate_var::Nothing, duplicate_obs::Nothing, callback::Nothing, kwargs::@Kwargs{})
    @ SingleCellProjections ~/.julia/packages/SingleCellProjections/0yZXZ/src/load.jl:221
 [32] load_counts
    @ ~/.julia/packages/SingleCellProjections/0yZXZ/src/load.jl:199 [inlined]
 [33] load_counts(filenames::String)
    @ SingleCellProjections ~/.julia/packages/SingleCellProjections/0yZXZ/src/load.jl:227
 [34] top-level scope
    @ REPL[8]:1
Some type information was truncated. Use `show(err)` to see complete types.

Locally modifying _read10x_features to get the first 200 bytes of the IO that readdlm was called on, this is what I find:

\x89HDF\r\n\x1a\n\0\0\0\0\0\b\b\0\x04\0\x10\0\0\0\0\0\0\0\0\0\0\0\0\0\xff\xff\xff\xff\xff\xff\xff\xff\$\x84\xd5F\0\0\0\0\xff\xff\xff\xff\xff\xff\xff\xff\0\0\0\0\0\0\0\0`\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\x88\0\0\0\0\0\0\0\xa8\x02\0\0\0\0\0\0\x01\0\x01\0\x01\0\0\0\x18\0\0\0\0\0\0\0\x11\0\x10\0\0\0\0\0\x88\0\0\0\0\0\0\0\xa8\x02\0\0\0\0\0\0TREE\0\0\x01\0\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\0\0\0\0\0\0\0\0\xe0\x05\0\0\0\0\0\0 \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0...
@rasmushenningsson
Copy link
Collaborator Author

Hi @tecosaur,

Unfortunately the h5ad support is experimental and not really tested. (There are different versions of the .h5ad format and I've only tried with one of them, quite some time ago.)

At the moment, you need to use loadh5ad, not load_counts to load .h5ad files. But I tried with your file and that doesn't work either. 🤦
(I should really try to improve the interface here.)

The good news is that it's probably an easy fix. I'll try to take look soon!

@tecosaur
Copy link

I'm guessing you've been short on time, do let me know when you've been able to spare a bit more on this though.

Speaking of the interface: from the impression I was able to get at JuliaCon and here, it does seem like SingleCellProjections is 80-90% of the way to a really nice go-to package for single cell analysis, with polish being the main thing needed. I still think it could be fun to see if you could write a julialang.org blog post on it once that's happened 🙂.

@rasmushenningsson
Copy link
Collaborator Author

Yes, I've been really busy lately unfortunately. But I did find some time today. Work in progress here: https://github.com/BioJulia/SingleCellProjections.jl/tree/h5adFixes

What's missing for loadh5ad is some code to figure out that certain annotation are categorical etc. and decode that. I expect it to be fairly trivial. When that's done, I'll merge and release a new version.

Down the line, I'll probably refactor the h5ad reading a bit. h5ad is basically a container format, and it seems tricky to really know what to expect inside. I think an easy to use interface to inspect what's inside a h5ad file (without loading) and then let the user choose what to load into the appropriate SingleCellProjections objects is the way to go.

Thanks for the encouragement! I'm working on some things that I really think will make it easier to use SingleCellProjections.jl. When that's done, a blog post sounds like a good idea!

@tecosaur
Copy link

Down the line, I'll probably refactor the h5ad reading a bit. h5ad is basically a container format, and it seems tricky to really know what to expect inside. I think an easy to use interface to inspect what's inside a h5ad file (without loading) and then let the user choose what to load into the appropriate SingleCellProjections objects is the way to go.

A quick comment on this: adding Muon as a lazy-dep for h5ad might help here?

@rasmushenningsson
Copy link
Collaborator Author

I initially want to avoid a lazy-dep on Muon, since it will load the entire h5ad file. For use in SingleCellProjections, you typically only want part of it (e.g. the raw count data and the annotations), so loading it all potentially wastes a lot of RAM.

However, I realize now that it's by far the quickest way to get good h5ad support:

  • Supporting different versions of the h5ad file format
  • Handling the different kinds of annotations, such as categorical, ordered, etc.

Down the line, I can then replace the Muon dep with a lazy h5ad reader. (Probably building on the one from Muon. In the ideal case, the Muon devs would also be interested in this. We'll see.)

@tecosaur
Copy link

Perhaps the Muon devs might even be interested in adding a way to just load the bits you need in SingleCellProjectsions?

@rasmushenningsson
Copy link
Collaborator Author

Here's a PR #24 which implements loading of h5ad files using Muon.
If you have any feedback on the interface, I can incorporate that before merging the PR.
Here's the docstring for create_datamatrix the main function in the PR:

  create_datamatrix([T], a::AnnData; add_var=false, add_obs=false)
  create_datamatrix([T], am::AlignedMapping, name; add_var=false, add_obs=false)

Creates a DataMatrix from an AnnData object.
By default, the main matrix X is retrieved from a::AnnData.
It is also possible to create DataMatrices from named objects in: a.layers, a.obsm, a.obsp, a.varm and a.varp. See examples below.

The optional parameter T determines the eltype of the returned matrix. If specified, the matrix will be converted to have this eltype.

kwargs:

  • add_var: Add var from the AnnData object to the returned DataMatrix (when applicable).
  • add_obs: Add obs from the AnnData object to the returned DataMatrix (when applicable).

Examples

All examples below assume that an AnnData object has been loaded first:

julia> using Muon

julia> a = readh5ad("path/to/file.h5ad");
  • Load the main matrix X from an AnnData object.
julia> create_datamatrix(a)
DataMatrix (123 variables and 456 observations)
  SparseMatrixCSC{Float32, Int32}
  Variables: id
  Observations: cell_id
  • Load the main matrix X from an AnnData object, and add var/obs annotations.
julia> create_datamatrix(a; add_var=true, add_obs=true)
DataMatrix (123 variables and 456 observations)
  SparseMatrixCSC{Float32, Int32}
  Variables: id, feature_type, ...
  Observations: cell_id, cell_type, ...
  • Load the main matrix X from an AnnData object, with eltype Int. NB: This will fail if the matrix is not a count matrix.
julia> create_datamatrix(Int, a)
DataMatrix (123 variables and 456 observations)
  SparseMatrixCSC{Int64, Int32}
  Variables: id
  Observations: cell_id
  • Load the matrix named raw_counts from layers, with eltype Int. NB: This will fail if the matrix is not a count matrix.
julia> create_datamatrix(Int, a.layers, "raw_counts")
DataMatrix (123 variables and 456 observations)
  SparseMatrixCSC{Int64, Int32}
  Variables: id
  Observations: cell_id
  • Load the matrix named UMAP from obsm.
julia> create_datamatrix(a.obsm, "UMAP")
DataMatrix (2 variables and 456 observations)
  Matrix{Float64}
  Variables: id
  Observations: cell_id

@rasmushenningsson
Copy link
Collaborator Author

I've just registered a new release including the Muon.jl extension.

After loading SingleCellProjections.jl and Muon.jl, you should be able to load the dataset with:

julia> anndata = readh5ad("path/to/TS_Eye.h5ad")
AnnData object 1065058870

julia> counts = SingleCellProjections.create_datamatrix(Int, anndata.layers, "raw_counts", add_var=true, add_obs=true)
DataMatrix (58870 variables and 10650 observations)
  SparseArrays.SparseMatrixCSC{Int64, Int32}
  Variables: id, gene_symbol, feature_type, ensemblid, highly_variable, means, dispersions, dispersions_norm, mean, ...
  Observations: cell_id, organ_tissue, method, donor, anatomical_information, n_counts_UMIs, n_genes, cell_ontology_class, free_annotation, ...

In general, you need to look inside the anndata object to figure out which matrix to load. The raw counts are usually stored in anndata.X or as a layer in anndata.layers. Note matrices are stored as floats in h5ad, so I pass Int as the first parameter to convert the count to integers which is needed in SingleCellProjections.jl.

@tecosaur
Copy link

tecosaur commented Nov 6, 2024

Brilliant, thanks for working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants