Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked writing of h5py.Dataset and zarr.Array #1624

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ivirshup
Copy link
Member

@ivirshup ivirshup commented Aug 28, 2024

This PR fixes #1623 by writing backed dense arrays in chunks.

Very open to feedback on the logic of how chunking pattern of writes is selected. Maybe we should prioritize the chunking of the destination array over the chunking of the source array?

cc: @ebezzi

Some proof it works:

%load_ext memory_profiler

import h5py
from anndata.experimental import write_elem
import numpy as np

f = h5py.File("tmp.h5", "w")
X = np.ones((10_000, 10_000))

%memit write_elem(f, "X", X)
# peak memory: 945.67 MiB, increment: 0.56 MiB

%memit write_elem(f, "X2", f["X"])
peak memory: 1047.00 MiB, increment: 101.12 MiB

%memit write_elem(f, "X3", f["X"], dataset_kwargs={"compression":"gzip"})
peak memory: 1068.03 MiB, increment: 6.14 MiB

Copy link

codecov bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 95.23810% with 1 line in your changes missing coverage. Please review.

Project coverage is 78.31%. Comparing base (af6480e) to head (690b682).

Files with missing lines Patch % Lines
src/anndata/_io/specs/methods.py 95.23% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (af6480e) and HEAD (690b682). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (af6480e) HEAD (690b682)
3 2
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1624      +/-   ##
==========================================
- Coverage   87.01%   78.31%   -8.70%     
==========================================
  Files          40       40              
  Lines        6059     6074      +15     
==========================================
- Hits         5272     4757     -515     
- Misses        787     1317     +530     
Files with missing lines Coverage Δ
src/anndata/_io/specs/methods.py 80.86% <95.23%> (-7.68%) ⬇️

... and 15 files with indirect coverage changes

src/anndata/_io/specs/methods.py Outdated Show resolved Hide resolved
src/anndata/_io/specs/methods.py Show resolved Hide resolved
src/anndata/_io/specs/methods.py Outdated Show resolved Hide resolved
@ilan-gold ilan-gold added this to the 0.10.10 milestone Aug 29, 2024
@ilan-gold ilan-gold modified the milestones: 0.10.10, 0.11.1, 0.11.2 Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing a h5py.Dataset loads the whole thing into memory
2 participants