You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On my machine, this takes 7s to write and 4s to load for a dictionary with only 20k elements.
How hard would it be to make this (significantly) faster?
Additional context
In scirpy, I use dicts of arrays (one index referring to $n$ cells) to store clonotype clusters. The dictionary is not (necessarily) aligned to one of the axes, therefore it's in uns. As we sped up the clonotype clustering steps, saving the object becomes a major bottleneck, as this dict can have several hundreds of thousands of keys.
We could possibly change the dictionary to something more efficient, but that would mean breaking our data format. Therefore I first wanted to check if it can be made faster on the anndata side.
Hmmm @gregor Sturm I would suspect the issue is that we recursively write the keys' values as their native data type, which means you end up creating thousands of zarr/hdf5 arrays. I'm not really sure we can do much about that at the moment. But with the coming zarr v3 we might in theory be able to do this in parallel, which would be a big boost. So I think we should wait for that: #1726 will be a first step just to getting things working.
I'm not sure the async/parallel zarr stuff works with v2, but I think it does.
Please make sure these conditions are met
Report
Code:
On my machine, this takes 7s to write and 4s to load for a dictionary with only 20k elements.
How hard would it be to make this (significantly) faster?
Additional context
In scirpy, I use dicts of arrays (one index referring to$n$ cells) to store clonotype clusters. The dictionary is not (necessarily) aligned to one of the axes, therefore it's in uns. As we sped up the clonotype clustering steps, saving the object becomes a major bottleneck, as this dict can have several hundreds of thousands of keys.
We could possibly change the dictionary to something more efficient, but that would mean breaking our data format. Therefore I first wanted to check if it can be made faster on the anndata side.
CC @felixpetschko
Versions
The text was updated successfully, but these errors were encountered: