Chunked writing of h5py.Dataset and zarr.Array #1624

ivirshup · 2024-08-28T23:44:39Z

Closes Writing a h5py.Dataset loads the whole thing into memory #1623
Tests added
- This might already be good on tests, but I should check
Release note added (or unnecessary)
Add benchmarks

This PR fixes #1623 by writing backed dense arrays in chunks.

Very open to feedback on the logic of how chunking pattern of writes is selected. Maybe we should prioritize the chunking of the destination array over the chunking of the source array?

cc: @ebezzi

Some proof it works:

%load_ext memory_profiler

import h5py
from anndata.experimental import write_elem
import numpy as np

f = h5py.File("tmp.h5", "w")
X = np.ones((10_000, 10_000))

%memit write_elem(f, "X", X)
# peak memory: 945.67 MiB, increment: 0.56 MiB

%memit write_elem(f, "X2", f["X"])
peak memory: 1047.00 MiB, increment: 101.12 MiB

%memit write_elem(f, "X3", f["X"], dataset_kwargs={"compression":"gzip"})
peak memory: 1068.03 MiB, increment: 6.14 MiB

for more information, see https://pre-commit.ci

codecov · 2024-08-29T00:10:41Z

Codecov Report

Attention: Patch coverage is 95.23810% with 1 line in your changes missing coverage. Please review.

Project coverage is 78.31%. Comparing base (af6480e) to head (690b682).

Files with missing lines	Patch %	Lines
src/anndata/_io/specs/methods.py	95.23%	1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (af6480e) and HEAD (690b682). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (af6480e) HEAD (690b682)

3 2

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1624      +/-   ##
==========================================
- Coverage   87.01%   78.31%   -8.70%     
==========================================
  Files          40       40              
  Lines        6059     6074      +15     
==========================================
- Hits         5272     4757     -515     
- Misses        787     1317     +530

Files with missing lines	Coverage Δ
src/anndata/_io/specs/methods.py	`80.86% <95.23%> (-7.68%)`	⬇️

... and 15 files with indirect coverage changes

src/anndata/_io/specs/methods.py

ivirshup and others added 3 commits August 28, 2024 16:40

Chunked writing of h5py.Dataset and zarr.Array

d60c3ab

[pre-commit.ci] auto fixes from pre-commit.com hooks

232bee4

for more information, see https://pre-commit.ci

Make n-dimensional

c43c5e2

Add some tests, which fail :(

749880b

ilan-gold reviewed Aug 29, 2024

View reviewed changes

src/anndata/_io/specs/methods.py Outdated Show resolved Hide resolved

src/anndata/_io/specs/methods.py Show resolved Hide resolved

src/anndata/_io/specs/methods.py Outdated Show resolved Hide resolved

ilan-gold added the skip-gpu-ci label Aug 29, 2024

ilan-gold added this to the 0.10.10 milestone Aug 29, 2024

Fix up chunking algorithm + add some types

32e008d

ilan-gold modified the milestones: 0.10.10, 0.11.1, 0.11.2 Nov 7, 2024

ilan-gold added 4 commits November 15, 2024 15:52

(chore): remove unneeded check?

b2192a2

(fix): dispatch to chunked writing for dense arrays

5938d86

(chore): remove unnecessary methods

99d4400

Merge branch 'main' into more-efficient-dense-writing

690b682

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked writing of h5py.Dataset and zarr.Array #1624

Chunked writing of h5py.Dataset and zarr.Array #1624

ivirshup commented Aug 28, 2024 •

edited

Loading

codecov bot commented Aug 29, 2024 •

edited

Loading

Chunked writing of h5py.Dataset and zarr.Array #1624

Are you sure you want to change the base?

Chunked writing of h5py.Dataset and zarr.Array #1624

Conversation

ivirshup commented Aug 28, 2024 • edited Loading

codecov bot commented Aug 29, 2024 • edited Loading

Codecov Report

ivirshup commented Aug 28, 2024 •

edited

Loading

codecov bot commented Aug 29, 2024 •

edited

Loading