Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support for writing sparse matrices to H5AD files #50

Open
mojaveazure opened this issue Aug 23, 2022 · 7 comments
Open

Comments

@mojaveazure
Copy link

Hello,

I would like to be able to write a sparse matrix to an H5AD file in a blocked manner using write_block(); I know we can do this with HDF5RealizationSink but that results in a dense matrix, which limits downstream functionality. As H5AD files have a specific format for sparse matrices on disk, a new H5AD-specific sink should be able to handle this, unless I'm missing something?

I'd be happy to help build this out, but I'm not familiar with the internals of RealizationSink so I'm not sure how much I can meaningfully contribute

@LTLA
Copy link

LTLA commented Aug 23, 2022

zellkonverter has some code for this:

https://github.com/theislab/zellkonverter/blob/f9303c4695221060e551ddffe2e40e58aeb66a0e/R/write.R#L158-L242

I use a variant of this for some in-house applications.

@hpages
Copy link
Contributor

hpages commented Aug 23, 2022

Hi @mojaveazure,

I suppose HDF5Array could provide an H5ADRealizationSink class similar to the existing TENxRealizationSink class. The latter can be used to write blocks of a sparse matrix to the 10x Genomics sparse format. It is used internally by writeTENxMatrix().

I just want to emphasize the fact that, like for the other data writing capabilities in HDF5Array, this new functionality would focus on writing the count data to disk, plus possibly the rownames (/var/_index in the h5ad file) and colnames (/obs/_index in the h5ad file) if present. This is because HDF5Array is meant to be a low-level package so all the other things that are typically found in an h5ad file would need to be taken care of by some higher-level functionalities implemented somewhere else.

I'm not too familiar with the h5ad format but is an h5ad file with only the /X group plus /obs/_index and /var/_index datasets considered valid? This is what the user would end up with when writing to a new file.

I was also going to mention zellkonverter but I see that @LTLA just did this while I was typing my answer. I only knew about zellkonverter::readH5AD() and zellkonverter::writeH5AD() though, to read/write a SingleCellExperiment object from/to an h5ad file. But it seems that they also have low-level utilities for writing data by block so maybe that's it 😄

H.

@LTLA
Copy link

LTLA commented Aug 23, 2022

IMO a worthwhile functionality would be more like a H5SparseRealizationSink that could be re-used in higher-level packages like zellkonverter. Then HDF5Array just needs to concern itself with the block-wise deposition of CSC/R content into a HDF5 file, while higher-level packages like zellkonverter can worry about the formatting miscellany.

@mojaveazure
Copy link
Author

Zellkonverter looks promising, but I don't think I can use it. Ideally, I want to write a function like this:

function(mat, sink) {
  grid <- colAutoGrid(mat)
  for (i in seq_len(length(grid))) {
    vp <- grid[[i]]
    x <- read_block(mat, vp, is_sparse(mat))
    # do some processing
    x <- f(x)
    write_block(sink, vp, x)
  }
  return(as(sink, "DelayedArray"))
}

And have it work with both H5AD and TileDB files, along with their respective sinks. I can use HDF5RealizationSink for H5AD files, but that writes a dense matrix on-disk, which ends up being cumbersome to use in downstream steps. I have no need for DelayedArray/HDF5Array to handle anything other than /X or /layers from an H5AD file and I agree that extra information should be in a higher-level package like zellkonverter

@mojaveazure
Copy link
Author

As for whether an H5AD file with only /X, /obs/_index, and /var/_index is valid, it would appear so

script hidden for brevity
#!/usr/bin/env python3

import h5py
import random
import anndata

from string import ascii_lowercase
from scipy.sparse import csr_matrix
from numpy.random import random_integers

print(anndata.__version__) # 0.8.0

#   Create some random data
x = random_integers(low=0, high=7, size=30)
x.shape = (5, 6)
mat = csr_matrix(x)
cells = [''.join(random.sample(ascii_lowercase, 5)) for _ in range(mat.shape[0])]
genes = [''.join(random.sample(ascii_lowercase, 7)) for _ in range(mat.shape[1])]

#   Create an H5AD file
adata = anndata.AnnData(mat)
adata.obs_names = cells
adata.var_names = genes
adata.write_h5ad("adata_test.h5ad")

#   Remove all groups except X, obs, and var
hfile = h5py.File("adata_test.h5ad", mode="r+")
items = {x[0] for x in hfile.items()}
items.difference_update({'X', 'obs', 'var'})

for it in items:
    hfile.pop(it)


hfile.close()

#   Check the file
adata2 = anndata.read_h5ad("adata_test.h5ad")
print(adata2.X != adata.X)
print(adata2.obs_names != adata.obs_names)
print(adata2.var_names != adata.var_names)


# Check backed matrix
adata3 = anndata.read_h5ad("adata_test.h5ad", backed=True)
print(adata3.X.to_memory() != adata.X)
print(adata3.obs_names != adata.obs_names)
print(adata3.var_names != adata.var_names)

In my experience, AnnData is pretty robust to incomplete H5AD files, and is capable of filling in the gaps when needed

@hpages
Copy link
Contributor

hpages commented Aug 24, 2022

@LTLA Yeah that's what I'm offering: to focus on block-wise deposition of the count data into the file. Whether that will be done thru something called H5SparseRealizationSink or H5ADRealizationSink is implementation detail at this point. It's actually very possible that I will introduce more than one RealizationSink subclass for this. Didn't really think about the details yet.

@mojaveazure Good to know about a minimalist H5AD file with only /X, /obs/_index, and /var/_index. Thanks for providing the Python code. So I think you have valid feature request. Now I'll need to be able to find some time to work on this. Can't promise anything at this point.

@mojaveazure
Copy link
Author

Awesome, thank you! I'd also be happy to help in any way I can

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants