Skip to content
This repository has been archived by the owner on Oct 12, 2020. It is now read-only.

Showing how to use R & memory-mapping to analyze data encoded as large matrices #4

Open
privefl opened this issue Oct 20, 2017 · 4 comments
Assignees

Comments

@privefl
Copy link
Contributor

privefl commented Oct 20, 2017

For multiple genomic data, most of the information can be stored as matrices. The most striking example is with SNP data, which can be stored as matrices with thousands to hundreds of thousands of rows (samples) with hundreds of thousands to dozens of millions of columns (SNPs) (Bycroft et al. 2017). This results in datasets of GygaBytes to TeraBytes of data.

Other fields in genomics, such as proteomics or expression data, use data stored as matrices potentially of size larger than available memory.

To address large data size in R, we can use memory-mapping for accessing large matrices stored on disk instead of in RAM. This has existed in R for several years thanks to package bigmemory (Kane, Emerson, and Weston 2013).

More recently, two packages which use the same principle as bigmemory have been developed: bigstatsr and bigsnpr (Privé, Aschard, and Blum 2017). Package bigstatsr implements many statistical tools for several types of Filebacked Big Matrices (FBMs), making it usable for any type of genomic data that can be encoded as a matrix. The statistical tools in bigstatsr include implementation of multivariate sparse linear models, Principal Component Analysis (PCA), matrix operations, and numerical summaries. Package bigsnpr implements algorithms which are specific to the analysis of SNP arrays, making use of already implemented features in package bigstatsr.

In this small tutorial, we’ll see the potential benefits of using memory-mapping instead of standard R matrices in memory, by using bigstatsr and bigsnpr.


You can find the first version of the tuto there.

@zhenyisong
Copy link
Contributor

I think this definitely should include imaging data. R have interface to parse the image data whatever format is. Imaging data from the confocal equipment (or other modern microscope) are huge and complex to process.

@privefl
Copy link
Contributor Author

privefl commented Oct 21, 2017

Have you some example data? Is this stored as matrices?

@zhenyisong
Copy link
Contributor

No. I plan to use R to process imaging data, including fMRI, in the near future. Imaging data for sure is matrix and we can use our linear algebra knowledge to deal with it. But our current task seems to have no mention of this type of analysis. And here is an interesting link.

@zhenyisong
Copy link
Contributor

Great. I absorbed a lot from your elegant code. I know Paris from his work, America Chef. And social etiquette in Paris < The Sweet Life in Paris> in his book.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants