Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

computeSumFactors converts some sparse matrices to dense #70

Open
scottgigante opened this issue Sep 11, 2020 · 5 comments
Open

computeSumFactors converts some sparse matrices to dense #70

scottgigante opened this issue Sep 11, 2020 · 5 comments

Comments

@scottgigante
Copy link

scottgigante commented Sep 11, 2020

Some formats of sparse matrix are causing massive allocation of memory. dgCMatrix works perfectly, using only a total of 2MB:

> library(scran)
> library(lineprof)
> 
> nrow = 5000
> ncol = 100000
> nnonzero = nrow*100
> 
> entry_index <- sample(nrow*ncol-1, nnonzero, replace=FALSE)
> X <- Matrix::sparseMatrix(i=(entry_index %/% ncol)+1, j=(entry_index %% ncol)+1, x=1)
> sce <- SingleCellExperiment(list(counts=as(X, "dgCMatrix")))
> 
> prof <- lineprof(computeSumFactors(sce))
dimnames(.) <- NULL:  translated to 
dimnames(.) <- list(NULL,NULL)  <==>  unname(.)
> prof
Reducing depth to 2 (from 34)
   time alloc release dups                                                ref                                        src
1 0.046 2.245       0  639                c(".calculate_sum_factors", "ncol") .calculate_sum_factors/ncol               
2 0.009 0.224       0    4 c(".calculate_sum_factors", ".limit_cluster_size") .calculate_sum_factors/.limit_cluster_size
3 0.006 0.273       0    1                           ".calculate_sum_factors" .calculate_sum_factors                    
4 0.021 0.347       0    6 c(".calculate_sum_factors", ".limit_cluster_size") .calculate_sum_factors/.limit_cluster_size
5 0.013 0.103       0    3               c(".calculate_sum_factors", "split") .calculate_sum_factors/split              
6 0.040 0.986       0  152     c(".calculate_sum_factors", ".guess_min_mean") .calculate_sum_factors/.guess_min_mean    
7 0.041 0.616       0  530            c(".calculate_sum_factors", "bplapply") .calculate_sum_factors/bplapply

while dgTMatrix allocates a whopping 65MB for the same operation

> library(scran)
> library(lineprof)
> 
> nrow = 5000
> ncol = 100000
> nnonzero = nrow*100
> 
> entry_index <- sample(nrow*ncol-1, nnonzero, replace=FALSE)
> X <- Matrix::sparseMatrix(i=(entry_index %/% ncol)+1, j=(entry_index %% ncol)+1, x=1)
> sce <- SingleCellExperiment(list(counts=as(X, "dgTMatrix")))
> 
> prof <- lineprof(computeSumFactors(sce))
dimnames(.) <- NULL:  translated to 
dimnames(.) <- list(NULL,NULL)  <==>  unname(.)
> shine(prof)
Starting interactive profile explorer.
Press Escape / Ctrl + C to exit


> prof
Reducing depth to 2 (from 49)
    time  alloc release dups                                                ref                                        src
1  0.031  2.516   0.000  659                c(".calculate_sum_factors", "ncol") .calculate_sum_factors/ncol               
2  0.001  0.049   0.000   14                           ".calculate_sum_factors" .calculate_sum_factors                    
3  0.006  0.176   0.000    1 c(".calculate_sum_factors", ".limit_cluster_size") .calculate_sum_factors/.limit_cluster_size
4  0.001  0.048   0.000    0                           ".calculate_sum_factors" .calculate_sum_factors                    
5  0.001  0.049   0.000    1 c(".calculate_sum_factors", ".limit_cluster_size") .calculate_sum_factors/.limit_cluster_size
6  0.003  0.224   0.000    0                           ".calculate_sum_factors" .calculate_sum_factors                    
7  0.019  0.347   0.000    6 c(".calculate_sum_factors", ".limit_cluster_size") .calculate_sum_factors/.limit_cluster_size
8  0.011  0.124   0.000    3               c(".calculate_sum_factors", "split") .calculate_sum_factors/split              
9  0.094  1.874   0.000  183     c(".calculate_sum_factors", ".guess_min_mean") .calculate_sum_factors/.guess_min_mean    
10 4.964 65.128  19.824 1874            c(".calculate_sum_factors", "bplapply") .calculate_sum_factors/bplapply 

and the same issue for dgRMatrix, allocating ~40MB.

library(scran)
library(lineprof)

nrow = 100000
ncol = 5000
nnonzero = ncol*100

X <- Matrix::sparseMatrix(i=(entry_index %/% ncol)+1, j=(entry_index %% ncol)+1, x=1)
X <- new("dgRMatrix", j=X@i, p=X@p, x=rep(1, nnonzero), Dim=as.integer(c(ncol, nrow)))
sce <- SingleCellExperiment(list(counts=X))

prof <- lineprof(computeSumFactors(sce))
prof
    time  alloc release dups                                                ref                                        src
1  0.011  2.304       0  659                c(".calculate_sum_factors", "ncol") .calculate_sum_factors/ncol               
2  0.001  0.049       0   14                           ".calculate_sum_factors" .calculate_sum_factors                    
3  0.011  0.224       0    1 c(".calculate_sum_factors", ".limit_cluster_size") .calculate_sum_factors/.limit_cluster_size
4  0.001  0.000       0    0                           ".calculate_sum_factors" .calculate_sum_factors                    
5  0.001  0.049       0    1 c(".calculate_sum_factors", ".limit_cluster_size") .calculate_sum_factors/.limit_cluster_size
6  0.006  0.224       0    0                           ".calculate_sum_factors" .calculate_sum_factors                    
7  0.028  0.347       0    6 c(".calculate_sum_factors", ".limit_cluster_size") .calculate_sum_factors/.limit_cluster_size
8  0.022  0.123       0    3               c(".calculate_sum_factors", "split") .calculate_sum_factors/split              
9  0.156  1.462       0  288     c(".calculate_sum_factors", ".guess_min_mean") .calculate_sum_factors/.guess_min_mean    
10 0.001  0.067       0   10       c(".calculate_sum_factors", ".subset2index") .calculate_sum_factors/.subset2index      
11 0.698 38.735       0 1080            c(".calculate_sum_factors", "bplapply") .calculate_sum_factors/bplapply           
12 0.001  0.000       0   16                                       character(0)                                           

Session info:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.10.so

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lineprof_0.1.9001           scran_1.16.0                SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.2 DelayedArray_0.14.1         matrixStats_0.56.0         
 [7] Biobase_2.48.0              GenomicRanges_1.40.0        GenomeInfoDb_1.24.2         IRanges_2.22.2              S4Vectors_0.26.1            BiocGenerics_0.34.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5                rsvd_1.0.3                locfit_1.5-9.4            lattice_0.20-41           R6_2.4.1                  ggplot2_3.3.2            
 [7] pillar_1.4.6              zlibbioc_1.34.0           rlang_0.4.7               rstudioapi_0.11           irlba_2.3.3               Matrix_1.2-18            
[13] BiocNeighbors_1.6.0       BiocParallel_1.22.0       statmod_1.4.34            stringr_1.4.0             igraph_1.2.5              RCurl_1.98-1.2           
[19] munsell_0.5.0             beachmat_2.4.0            compiler_4.0.2            vipor_0.4.5               BiocSingular_1.4.0        pkgconfig_2.0.3          
[25] ggbeeswarm_0.6.0          tidyselect_1.1.0          tibble_3.0.3              gridExtra_2.3             GenomeInfoDbData_1.2.3    edgeR_3.30.3             
[31] viridisLite_0.3.0         crayon_1.3.4              dplyr_1.0.2               bitops_1.0-6              grid_4.0.2                gtable_0.3.0             
[37] lifecycle_0.2.0           magrittr_1.5              scales_1.1.1              dqrng_0.2.1               stringi_1.5.3             XVector_0.28.0           
[43] viridis_0.5.1             limma_3.44.3              scater_1.16.2             DelayedMatrixStats_1.10.1 ellipsis_0.3.1            generics_0.0.2           
[49] vctrs_0.3.4               tools_4.0.2               glue_1.4.2                beeswarm_0.2.3            purrr_0.3.4               colorspace_1.4-1
@LTLA
Copy link
Collaborator

LTLA commented Sep 11, 2020

Probably it's hitting beachmat code in C++, which doesn't know anything about dgTMatrix and dgRMatrix. All such "unknown" matrices are instead handled via block processing, which realizes blocks of the matrix as ordinary arrays with an upper memory usage defined by DelayedArray::getAutoBlockSize(). This defaults to 100 MB.

There are no plans to provide native support for dgTMatrix and dgRMatrix. The former can't be accessed efficiently and R-level operations convert them to dgCMatrix anyway. As for the latter, I have never seen it used in real analyses.

@scottgigante
Copy link
Author

You may have never seen it but it's the reason why my jobs are failing, hence this issue report :) Seems like a fairly trivial fix to include a

if (is(X, "sparseMatrix")) X <- as(X, "CsparseMatrix")

rather than silently coerce the matrix to dense, hope the user notices and fixes it themself.

@LTLA
Copy link
Collaborator

LTLA commented Sep 11, 2020

It's not like the entire matrix is being coerced to dense form. It's bounded by a 100 MB limit, as defined by getAutoBlockSize; your jobs should be able to handle that. The same code is used to handle all unknown matrices, e.g., RleMatrix, ResidualMatrix, various other forms of DelayedMatrix representations: I don't see why dgTMatrixes and dgRMatrixes should get special treatment here.

@scottgigante
Copy link
Author

In this toy example, the dgTMatrix also takes ~100x longer than the dgCMatrix. Seems like a good reason to me.

@LTLA
Copy link
Collaborator

LTLA commented Sep 11, 2020

One could say that about many matrix formats.

For example, I could store a sparse matrix in a HDF5Matrix, and under this reasoning, the function would be expected to convert it to a dgCMatrix for further processing. This is not a decision that the function should be allowed to make - for example, it would not be aware of memory constraints that motivated the use of the HDF5Matrix in the first place.

In the specific case of the dgTMatrix, an automated conversion is unwise if there are specific reasons for storing it as a dgTMatrix instead of a dgCMatrix. The most obvious is that the latter fails due to integer overflow in its p vector when there are more than ~2e9 non-zero elements in the matrix. The former stays operational at the cost of speed and memory.

Given that you're talking about toy examples, the easiest solution would just be to convert your matrix to the desired format. scran won't do any conversion, that is not its decision to make. For real analyses, all ingestion pipelines that I use will produce a dgCMatrix directly so I don't think this lack of automated conversion has much practical consequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants