Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing a h5py.Dataset loads the whole thing into memory #1623

Open
3 tasks done
ivirshup opened this issue Aug 28, 2024 · 2 comments · May be fixed by #1624
Open
3 tasks done

Writing a h5py.Dataset loads the whole thing into memory #1623

ivirshup opened this issue Aug 28, 2024 · 2 comments · May be fixed by #1624

Comments

@ivirshup
Copy link
Member

ivirshup commented Aug 28, 2024

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

Code:

%load_ext memory_profiler

import h5py
from anndata.experimental import write_elem
import numpy as np

f = h5py.File("tmp.h5", "w")
X = np.ones((10_000, 10_000))

%memit write_elem(f, "X", X)
# peak memory: 940.14 MiB, increment: 0.00 MiB

%memit write_elem(f, "X2", f["X"])
# peak memory: 1702.89 MiB, increment: 762.75 MiB

The second write doubles the amount of memory. We can move to a chunked approach to writing pretty easily from the solution suggested here:

dst_ds = f.create_dataset_like('dst', src_ds, dtype=np.int64)

for chunk in src_ds.iter_chunks():
    dst_ds[chunk] = src_ds[chunk]

Versions

-----
IPython             8.26.0
anndata             0.11.0.dev168+g8cc5a18
h5py                3.11.0
numpy               1.26.4
session_info        1.0.0
-----
asciitree           NA
asttokens           NA
bottleneck          1.4.0
cloudpickle         3.0.0
cython_runtime      NA
dask                2024.8.1
dateutil            2.9.0.post0
decorator           5.1.1
executing           2.0.1
importlib_metadata  NA
jedi                0.19.1
jinja2              3.1.4
markupsafe          2.1.5
memory_profiler     0.61.0
msgpack             1.0.8
natsort             8.4.0
numcodecs           0.13.0
numexpr             2.10.1
packaging           24.1
pandas              2.2.1
parso               0.8.4
prompt_toolkit      3.0.47
psutil              5.9.8
pure_eval           0.2.2
pyarrow             15.0.2
pygments            2.18.0
pytz                2024.1
scipy               1.12.0
setuptools          70.3.0
six                 1.16.0
stack_data          0.6.3
tblib               3.0.0
tlz                 0.12.1
toolz               0.12.1
traitlets           5.14.3
typing_extensions   NA
wcwidth             0.2.13
yaml                6.0.1
zarr                2.18.2
zipp                NA
-----
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Linux-6.8.0-1010-aws-x86_64-with-glibc2.39
-----
Session information updated at 2024-08-28 22:36
@ivirshup
Copy link
Member Author

Some complications:

  • iter_chunks errors if the hdf5 array is memory mapped and not chunked
  • What if the output is chunked but the input isn't?
  • What if neither are chunked? It would still be valuable to cut down memory usage.

@ivirshup ivirshup linked a pull request Aug 28, 2024 that will close this issue
4 tasks
@ilan-gold ilan-gold assigned ilan-gold and ivirshup and unassigned ilan-gold Aug 29, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

@github-actions github-actions bot added the stale label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants