Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a .tsv.gz file #18

Open
slowkow opened this issue Apr 11, 2020 · 2 comments
Open

Reading a .tsv.gz file #18

slowkow opened this issue Apr 11, 2020 · 2 comments

Comments

@slowkow
Copy link

slowkow commented Apr 11, 2020

Suppose we have a matrix of single-cell RNA-seq data that looks like this:

zcat exprMatrix.tsv.gz | head | cut -f1-5

gene    sci3-me-001.GTCGGAGTTTGAGGTAGAA sci3-me-001.ATTAGTCTGTGTATAATACG        sci3-me-001.GAGGAACTTAATACCATCC sci3-me-001.TTCGCGGATACTCTCTCAA
ENSMUSG00000051951.5|Xkr4       0       0       0       0
ENSMUSG00000103377.1|Gm37180    0       0       0       0
ENSMUSG00000104017.1|Gm37363    0       0       0       0
ENSMUSG00000103025.1|Gm37686    0       0       0       0
ENSMUSG00000089699.1|Gm1992     0       0       0       0
ENSMUSG00000103201.1|Gm37329    0       0       0       0
ENSMUSG00000103161.1|Gm38148    0       0       0       0
ENSMUSG00000102331.1|Gm19938    0       0       0       0
ENSMUSG00000102343.1|Gm37381    0       0       0       0

Rows:

zcat exprMatrix.tsv.gz | wc -l

26184

Columns:

zcat exprMatrix.tsv.gz | head -n1 | wc -w

2058653

Could I please ask if you might be able to share an R code snippet for how to use the beachmat package to read this data into a sparse matrix (dgCMatrix)?

@LTLA
Copy link
Member

LTLA commented Apr 12, 2020

There seem to be a few misconceptions here.

beachmat provides an API for reading from (and writing to) R matrices from within C++ code inside R packages, typically via Rcpp. The intended application is that you already have your matrix or matrix-like object in R and you want to perform some computation on its columns or rows in C++. In this case, one would write C++ code that links to beachmat's headers in order to ensure that the C++ code works for any matrix representation, e.g., ordinary, sparse, file-backed, and so on.

beachmat is not directly designed for I/O between R and the file system. One might say that it operates between C++ and the file system, but that's only for file-backed matrices like HDF5Matrix objects. So, the technically correct answer to your question would be to create a DelayedArray backend that reads from the TSV file, which would then be used by beachmat to read rows (and columns, though that's much harder here) on demand in any C++ code that links to beachmat.

Probably the more useful answer to your question is to use scater::readSparseCounts to read it in as a sparse matrix. Note that dgCMatrix objects do not support more than .Machine$integer.max non-zero observations, so YMMV.

@slowkow
Copy link
Author

slowkow commented Apr 12, 2020

Thanks for the helpful description of beachmat.

scater::readSparseCounts() is exactly what I was looking for. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants