Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing dict in uns with many keys is slow #1684

Open
2 of 3 tasks
grst opened this issue Sep 21, 2024 · 0 comments
Open
2 of 3 tasks

Writing dict in uns with many keys is slow #1684

grst opened this issue Sep 21, 2024 · 0 comments

Comments

@grst
Copy link
Contributor

grst commented Sep 21, 2024

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

Code:

import anndata
import numpy as np

adata = anndata.AnnData()
adata.uns["x"] = {str(i): np.array(str(i), dtype="object") for i in range(20000)}

# %%time
adata.write_h5ad("/tmp/anndata.h5ad")

# %%time
anndata.read_h5ad("/tmp/anndata.h5ad")

On my machine, this takes 7s to write and 4s to load for a dictionary with only 20k elements.
How hard would it be to make this (significantly) faster?

Additional context

In scirpy, I use dicts of arrays (one index referring to $n$ cells) to store clonotype clusters. The dictionary is not (necessarily) aligned to one of the axes, therefore it's in uns. As we sped up the clonotype clustering steps, saving the object becomes a major bottleneck, as this dict can have several hundreds of thousands of keys.

We could possibly change the dictionary to something more efficient, but that would mean breaking our data format. Therefore I first wanted to check if it can be made faster on the anndata side.

CC @felixpetschko

Versions

-----
anndata             0.9.2
numpy               1.24.4
session_info        1.0.0
-----
asciitree           NA
asttokens           NA
awkward             2.6.4
awkward_cpp         NA
backcall            0.2.0
cloudpickle         2.2.1
comm                0.1.4
cython_runtime      NA
dask                2023.8.1
dateutil            2.8.2
debugpy             1.6.8
decorator           5.1.1
entrypoints         0.4
executing           1.2.0
fasteners           0.18
fsspec              2023.6.0
h5py                3.9.0
importlib_metadata  NA
ipykernel           6.25.0
jedi                0.19.0
...
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
Linux-6.10.5-arch1-1-x86_64-with-glibc2.40
-----
Session information updated at 2024-09-21 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants