Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreting NATIVE_UINT8 as integer in sparse matrices #56

Open
LTLA opened this issue Jul 7, 2023 · 3 comments
Open

Interpreting NATIVE_UINT8 as integer in sparse matrices #56

LTLA opened this issue Jul 7, 2023 · 3 comments

Comments

@LTLA
Copy link

LTLA commented Jul 7, 2023

Sometimes I store sparse counts in the HDF5 file as unsigned 8-bit integers to save some space. This is fine but the subsequent H5SparseMatrix instance is not able to participate in arithmetic operations:

library(rhdf5)
y <- abs(round(Matrix::rsparsematrix(100, 100, 0.01) * 10))

tmp <- tempfile(fileext=".h5")
h5createFile(tmp)
h5createGroup(tmp, "matrix")
h5createDataset(tmp, "matrix/data", length(y@x), H5type="H5T_NATIVE_UINT8")
h5write(y@i, tmp, "matrix/indices", length(y@i))
h5write(y@p, tmp, "matrix/indptr", length(y@p))

fhandle <- H5Fopen(tmp)
ghandle <- H5Gopen(fhandle, "matrix")
h5writeDataset(y@x, ghandle, "data")
H5Gclose(ghandle)
H5Fclose(fhandle)

library(HDF5Array)
seed <- H5SparseMatrixSeed(tmp, "matrix", dim=c(100, 100), sparse.layout="csc")
mat <- DelayedArray(seed)
type(mat)
## [1] "raw"

mat + 1
## Error in h(simpleError(msg, call)) :
##   error in evaluating the argument 'x' in selecting a method for function 'type': non-numeric argument to binary operator

This might be easily solved with a type= option, just like in the HDF5ArraySeed constructor for the dense case.

Or even better, a dedicated as.integer= option that treats all HDF5 integer types as R integers. This would allow me to just set as.integer=TRUE and everything should work; otherwise, even with a type= option, I need to first create the HDF5ArraySeed with default arguments, check if it's type(mat) == "raw", and then create it again with type="integer". (Presumably, I can't just set type="integer" all the time, otherwise bad things will happen for floating-point matrices.)

Session information
R version 4.3.0 Patched (2023-05-04 r84398)
Platform: aarch64-apple-darwin22.3.0 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Users/luna/Software/R/R-4-3-branch/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/R-4-3-branch/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] HDF5Array_1.29.3      DelayedArray_0.27.9   SparseArray_1.1.10
 [4] S4Arrays_1.1.4        IRanges_2.35.2        S4Vectors_0.39.1
 [7] MatrixGenerics_1.13.0 matrixStats_1.0.0     BiocGenerics_0.47.0
[10] Matrix_1.5-4.1        rhdf5_2.45.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.47.0     lattice_0.21-8      rhdf5filters_1.13.3
[4] XVector_0.41.1      Rhdf5lib_1.23.0     grid_4.3.0
[7] compiler_4.3.0      tools_4.3.0         crayon_1.5.2
@hpages
Copy link
Contributor

hpages commented Oct 6, 2023

Hi @LTLA,

Sorry for the slow response on this. Note that you can change the type of a DelayedArray object anytime with:

if (type(mat) == "raw")
    type(mat) <- "integer"

In particular, you don't need to re-create the HDF5ArraySeed or H5SparseMatrixSeed object if its type turns out to be raw.

type<- is a delayed op so a big advantage of this approach is that it's completely generic i.e. it works on any DelayedArray object independently of the kind of seed (e.g. HDF5ArraySeed, H5SparseMatrixSeed, TileDBArraySeed, SparseArray, etc...), and regardless of whether it carries delayed operations or not, or whether it's sparse or not. It always works!

Would that be good enough or do you feel strongly about adding the as.integer argument to the H5SparseMatrixSeed() constructor? There might be a very small performance benefit in doing this, at least in theory, because the as.integer argument would then be passed down to h5mread(), which should be sligthly more efficient at turning those H5T_NATIVE_UINT8 values into integers. However I doubt it would be noticeable in practice. Also this change would require to add the as_integer slot to the H5SparseMatrixSeed objects so that they can "remember" the value of the as.integer argument.

Finally note that there would be no reason to not make similar changes to the HDF5ArraySeed() and HDF5Array() constructors, and to the HDF5ArraySeed class, for consistency.

H.

@LTLA
Copy link
Author

LTLA commented Oct 9, 2023

Would that be good enough or do you feel strongly about adding the as.integer argument to the H5SparseMatrixSeed() constructor?

Yes, that would solve my immediate problem (of getting an integer sparse matrix).

However, this might cause some performance issues down the line... but not for the reasons you've stated. In my various frameworks (e.g., alabaster.xxx, chihaya), I can perform some optimizations if I detect that the DelayedArray is pristine. For example, I could consider creating a symlink to the underlying HDF5 file when saving a HDF5Array via alabaster.base::stageObject(), rather than performing a round of block-processing to load and save data in a new file.

If I use the type<- method, the DelayedArray is no longer pristine as it gets a storage.mode<- delayed op in a DelayedUnaryIsoOpStack layer. This precludes the optimization mentioned above. Of course, I can get around this by more deeply inspecting the DelayedArray's operations to ignore certain operations. This is a little tedious... which is acceptable, and normally I'd just go ahead and handle it on my end. But then I noticed that HDF5ArraySeed already had a type= option, so it seems like it wouldn't hurt to do the same for H5SparseMatrixSeed for consistency.

(Also having thought about it some more, type= would be fine. I don't really need an as_integer option as I already know the R type I want to use to interpret the HDF5 dataset.)

Finally note that there would be no reason to not make similar changes to the HDF5ArraySeed() and HDF5Array() constructors, and to the HDF5ArraySeed class, for consistency.

Yes, that's what I was thinking, given that both of these already have a type= option in their constructors, whereas the sparse matrices do not.

@LTLA
Copy link
Author

LTLA commented Nov 28, 2023

FWIW I recently encountered a related problem, where my data was stored as a H5T_NATIVE_UINT32 with values that exceeded .Machine$integer.max. I would have liked to instruct H5SparseMatrix to load them as doubles, but currently the code just silently caps the values at .Machine$integer.max.

library(HDF5Array)

m0 <- matrix(0, nrow=25, ncol=12,
    dimnames=list(letters[1:25], LETTERS[1:12]))
m0[cbind(2:24, c(12:1, 2:12))] <- 100 + sample(55, 23, replace=TRUE)
out_file <- tempfile()
M0 <- writeTENxMatrix(m0, out_file, group="m0")

# Let's make the data more exciting.
fhandle <- H5Fopen(out_file, "H5F_ACC_RDWR")
ghandle <- H5Gopen(fhandle, "m0")
H5Ldelete(ghandle, "data")

shandle <- H5Screate_simple(23)
dhandle <- H5Dcreate(ghandle, "data", dtype_id="H5T_NATIVE_UINT32", h5space=shandle)
H5Dwrite(dhandle, .Machine$integer.max + as.double(1:23))

H5Dclose(dhandle)
H5Sclose(shandle)
H5Gclose(ghandle)
H5Fclose(fhandle)

H5SparseMatrix(out_file, "m0")
## <25 x 12> sparse H5SparseMatrix object of type "integer":
##        [,1]  [,2]  [,3]  [,4] ...       [,9]      [,10]      [,11]      [,12]
##  [1,]     0     0     0     0   .          0          0          0          0
##  [2,]     0     0     0     0   .          0          0          0 2147483647
##  [3,]     0     0     0     0   .          0          0 2147483647          0
##  [4,]     0     0     0     0   .          0 2147483647          0          0
##  [5,]     0     0     0     0   . 2147483647          0          0          0
##   ...     .     .     .     .   .          .          .          .          .
## [21,]     0     0     0     0   . 2147483647          0          0          0
## [22,]     0     0     0     0   .          0 2147483647          0          0
## [23,]     0     0     0     0   .          0          0 2147483647          0
## [24,]     0     0     0     0   .          0          0          0 2147483647
## [25,]     0     0     0     0   .          0          0          0          0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants