-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Support for writing sparse matrices to H5AD files #50
Comments
zellkonverter has some code for this: I use a variant of this for some in-house applications. |
Hi @mojaveazure, I suppose HDF5Array could provide an H5ADRealizationSink class similar to the existing TENxRealizationSink class. The latter can be used to write blocks of a sparse matrix to the 10x Genomics sparse format. It is used internally by I just want to emphasize the fact that, like for the other data writing capabilities in HDF5Array, this new functionality would focus on writing the count data to disk, plus possibly the rownames ( I'm not too familiar with the I was also going to mention zellkonverter but I see that @LTLA just did this while I was typing my answer. I only knew about H. |
IMO a worthwhile functionality would be more like a |
Zellkonverter looks promising, but I don't think I can use it. Ideally, I want to write a function like this: function(mat, sink) {
grid <- colAutoGrid(mat)
for (i in seq_len(length(grid))) {
vp <- grid[[i]]
x <- read_block(mat, vp, is_sparse(mat))
# do some processing
x <- f(x)
write_block(sink, vp, x)
}
return(as(sink, "DelayedArray"))
} And have it work with both H5AD and TileDB files, along with their respective sinks. I can use |
As for whether an H5AD file with only script hidden for brevity#!/usr/bin/env python3
import h5py
import random
import anndata
from string import ascii_lowercase
from scipy.sparse import csr_matrix
from numpy.random import random_integers
print(anndata.__version__) # 0.8.0
# Create some random data
x = random_integers(low=0, high=7, size=30)
x.shape = (5, 6)
mat = csr_matrix(x)
cells = [''.join(random.sample(ascii_lowercase, 5)) for _ in range(mat.shape[0])]
genes = [''.join(random.sample(ascii_lowercase, 7)) for _ in range(mat.shape[1])]
# Create an H5AD file
adata = anndata.AnnData(mat)
adata.obs_names = cells
adata.var_names = genes
adata.write_h5ad("adata_test.h5ad")
# Remove all groups except X, obs, and var
hfile = h5py.File("adata_test.h5ad", mode="r+")
items = {x[0] for x in hfile.items()}
items.difference_update({'X', 'obs', 'var'})
for it in items:
hfile.pop(it)
hfile.close()
# Check the file
adata2 = anndata.read_h5ad("adata_test.h5ad")
print(adata2.X != adata.X)
print(adata2.obs_names != adata.obs_names)
print(adata2.var_names != adata.var_names)
# Check backed matrix
adata3 = anndata.read_h5ad("adata_test.h5ad", backed=True)
print(adata3.X.to_memory() != adata.X)
print(adata3.obs_names != adata.obs_names)
print(adata3.var_names != adata.var_names) In my experience, AnnData is pretty robust to incomplete H5AD files, and is capable of filling in the gaps when needed |
@LTLA Yeah that's what I'm offering: to focus on block-wise deposition of the count data into the file. Whether that will be done thru something called H5SparseRealizationSink or H5ADRealizationSink is implementation detail at this point. It's actually very possible that I will introduce more than one RealizationSink subclass for this. Didn't really think about the details yet. @mojaveazure Good to know about a minimalist H5AD file with only |
Awesome, thank you! I'd also be happy to help in any way I can |
Hello,
I would like to be able to write a sparse matrix to an H5AD file in a blocked manner using
write_block()
; I know we can do this withHDF5RealizationSink
but that results in a dense matrix, which limits downstream functionality. As H5AD files have a specific format for sparse matrices on disk, a new H5AD-specific sink should be able to handle this, unless I'm missing something?I'd be happy to help build this out, but I'm not familiar with the internals of
RealizationSink
so I'm not sure how much I can meaningfully contributeThe text was updated successfully, but these errors were encountered: