Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

row/colVars() and row/colSds() methods for xgCMatrix objects silently ignore the supplied center #13

Open
hpages opened this issue Nov 13, 2020 · 5 comments

Comments

@hpages
Copy link

hpages commented Nov 13, 2020

See for example the definition of the rowVars() method:

library(sparseMatrixStats)

selectMethod("rowVars", "xgCMatrix")
# Method Definition:
# 
# function (x, rows = NULL, cols = NULL, na.rm = FALSE, center = NULL, ...)
# {
#     .local <- function (x, rows = NULL, cols = NULL, na.rm = FALSE)
#     {
#         if (!is.null(rows)) {
#             x <- x[rows, , drop = FALSE]
#         }
#         if (!is.null(cols)) {
#             x <- x[, cols, drop = FALSE]
#         }
#         dgCMatrix_rowVars(x, na_rm = na.rm)
#     }
#     .local(x, rows, cols, na.rm, ...)
# }
# <bytecode: 0x8146588>
# <environment: namespace:sparseMatrixStats>
# 
# Signatures:
#         x
# target  "xgCMatrix"
# defined "xgCMatrix"

Here is a concrete example using a dgCMatrix object:

library(Matrix)
i <- 1:5
j <- 2:6
m0 <- sparseMatrix(i, j, x=10*i+j, dims=c(5, 7))  # dgCMatrix object

center not supplied:

matrixStats::rowVars(as.matrix(m0))
# [1]  20.57143  75.57143 165.14286 289.28571 448.00000
rowVars(m0)
# [1]  20.57143  75.57143 165.14286 289.28571 448.00000

center supplied:

matrixStats::rowVars(as.matrix(m0), center=0)
# [1]  24.00000  88.16667 192.66667 337.50000 522.66667
rowVars(m0, center=0)  # 'center' is ignored!
# [1]  20.57143  75.57143 165.14286 289.28571 448.00000

Thanks,
H.

sessionInfo():

R Under development (unstable) (2020-10-28 r79382)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.1.r79382/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r79382/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Matrix_1.2-18           sparseMatrixStats_1.3.0 MatrixGenerics_1.3.0   
[4] matrixStats_0.57.0     

loaded via a namespace (and not attached):
[1] compiler_4.1.0  Rcpp_1.0.5      grid_4.1.0      lattice_0.20-41
@const-ae
Copy link
Owner

Hi Hervé,

I understood Henrik in HenrikBengtsson/matrixStats#183 that center must be an honest center estimate of the column means:

This calculation is based on the assumption that center is a proper estimate of the column means.
HenrikBengtsson/matrixStats#183 (comment)

Accordingly, rowVars(m0, center=0) is undefined behavior.

I could of course replicate the behavior of matrixStats, but as the result is non-sense, I am not sure what we would gain by that.

My thinking for not implementing the center parameter was that at the moment center is just a performance optimization when one calculates mean and var with matrixStats and thus wasn't a priority for sparseMatrixStats.
Best, Constantin

@hpages
Copy link
Author

hpages commented Nov 14, 2020

Hi Constantin,

My use case is to efficiently compute:

f <- function(v) {
    v <- v[!is.na(v)]
    sum(v^2) / max(1L, length(v) - 1L)
}
apply(m, 1L, f)

I was hoping I could do it with:

rowVars(m, center=0, na.rm=TRUE)

which works fine when m is an ordinary matrix but not when it's a sparse matrix.

But that's ok, I guess I can use:

rowSums(m^2 / (rowSums(!is.na(m)) - 1L), na.rm=TRUE)

instead.

Doesn't seem that it would be too hard to generalize this to an arbitrary center e.g. by doing something like this in the rowVars() method for xgCMatrix objects:

if (is.null(center)) {
    dgCMatrix_rowVars(x, na_rm=na.rm)
} else {
    rowSums((m-center)^2) / (rowSums(!is.na(m)) - 1L, na.rm=na.rm)
}

Anyways, at the very least the methods in sparseMatrixStats should issue a warning or an error that center is not supported. Silently ignoring the argument is no good.

Thanks!

P.S.: FWIW my use case comes from here: https://github.com/Bioconductor/DelayedArray/blob/aa9a74556f5c36307ceb638eb31199477139fea8/R/DelayedArray-utils.R#L857
Not that tx will ever be a dgCMatrix object here but I was playing around a little with rowSds(m, center=0, na.rm=TRUE) to try to get a sense of how much I can trust this and my conclusion was: not too much. This could easily hit someone who wants to implement a native scale() method for dgCMatrix objects.

@HenrikBengtsson
Copy link

@hpages, I don't think this is a good idea. There are at least two different ways to calculate the variance and it's not safe to make assumptions on which one is used. This is normally not a problem when there's no center argument. Your use case is one example of why I'm getting more and more concerned about exposing the center argument. It might become used, or who knows is already used, in ways that give incorrect results, and then if we switch the calculation internally things might break silently.

@hpages
Copy link
Author

hpages commented Nov 20, 2020

I agree. Better not having it than having it do something that is not well defined.

@LTLA
Copy link

LTLA commented Nov 24, 2020

I'll just mention that I've been obsessively using center= with some precomputed row means under the assumption that it was more efficient. But upon inspecting the code, it's not always obvious that's the case, given that the center= path contains an extra allocation of the same dimensions as x. Some testing bears that out:

library(matrixStats)
y <- matrix(rnorm(1e8), ncol=1000)
system.time(out <- rowVars(y))
##    user  system elapsed 
##   0.371   0.000   0.371 

rm <- rowMeans(y)
system.time(out2 <- rowVars(y, center=rm))
##    user  system elapsed 
##   0.298   0.144   0.441 

(As an aside: in this case, using center= gets better when we increase the number of columns - presumably the cache misses start to add up in the C_rowVars function, beyond the cost of doing an extra allocation? In my own applications, I use Welford's algorithm to get around this via column-major processing - seems to work well enough.)

Anyway, the tl;dr is that there doesn't seem to be an unambiguous performance advantage to using center=, so I guess I'd be happy to see it gone. Certainly it would make the lives of the downstream *MatrixStats developers easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants