RPCA-MKL is a Intel MKL-based, out-of-core C++ implementation of randomized SVD for rank-k approximation of large matrices. It is primarily intended to be used via an R wrapper.
- Clone this git repository
cd fastRPCA
R CMD INSTALL . --no-staged-install
Please see the documentation for usage: ?fastPCA
Note: The pre-compiled MKL libraries are only included for Linux and OS X, so this installation will not work for Windows. Windows users can see Development for details about compling the code.
Test the code by running the following:
source(sprintf("%s/test.R", system.file("tests", package="fastRPCA")))
- All matrix algebra is done with Intel MKL (pre-compiled version already linked) making it extremely fast
- Row-centering and column-centering (without duplicating the matrix)
- The calculations are 'blocked' allowing it to be 'out-of-core'. This functionality has not been tested appropriately; please see oocRPCA for a functioning version.
- CSV files: when too large for the memory, read block by block from the hard drive
- BED files: when too large for the memory, stored in a compressed 2 bit-per-SNP format, and then decompressed block by block for calculations
This packages uses Intel Math Kernel Library, which has highly optimized implementations of BLAS and LAPACK (Free download here ). The lib
folder contains a custom built shared library (see lib/generate_custom_mkl.sh
), but the headers cannot be distributed. As such, to compile from source, Intel MKL must be installed on your machine.
- Update the
MKL_INCLUDE
path in the Makefile. Make sure the compiler in theCC
variable is openmp compatible. On OS X, for example, you can use homebrew to install libiomp5 for Clang/LLVM. Just be sure that it is the compiler being used. - Run
make
in thesrc
folder. - You may need to export the
LD_LIBRARY_PATH
or (on OS X)DYLD_LIBRARY_PATH
so that the new executable can find the necessary dynamic libraries in thelib
folder. For example:export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/user/Downloads/fastPCA/lib"
.
This implementation is based on Algorithm 2 of the following paper:
George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, Yuval Kluger. (2019). Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature Methods. (link)
An excellent reference for randomized SVD is the following paper:
Halko, Nathan, Per-Gunnar Martinsson, and Joel A. Tropp. "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions." SIAM review 53.2 (2011): 217-288.