Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Write zarr without using pickle #171

Open
3 tasks done
magland opened this issue Feb 29, 2024 · 5 comments
Open
3 tasks done

[Feature]: Write zarr without using pickle #171

magland opened this issue Feb 29, 2024 · 5 comments
Assignees
Labels
category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)
Milestone

Comments

@magland
Copy link

magland commented Feb 29, 2024

What would you like to see added to HDMF-ZARR?

I was working on adding nwb-zarr support to neurosift and I ran into a problem where many of the datasets within the zarr were pickle-encoded. It is difficult to read pickled data in languages other than Python, and it is practically impossible using JavaScript in the browser. So I would like to request that hdmf-zarr be updated to avoid using pickling when writing datasets.

Is your feature request related to a problem?

No response

What solution would you like?

@bendichter and I created a script for doing the h5->zarr conversion without using pickle:
https://gist.github.com/magland/39c5c5c5afabb982b5dd5d2367f2179b

and we confirmed that it (a) the output is loadable in hdmf-zarr and (b) it works with Neurosift

We have only tried it on the one nwb file so far.

Perhaps this script could be helpful in making the modifications.

I perused the hdmf-zarr source code, but it wasn't clear to me where the changes needed to take place.

Tagging: @rly

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

@oruebel
Copy link
Contributor

oruebel commented Feb 29, 2024

Thanks for the suggestion. I think this means updating the calls to require_dataset in backend.py to set the object_codec. I believe currently the object_codec is not being specified, so it should use whatever the default in Zarr is. The question is, which encoder is the best one to use, i.e., is most portable and most performant in terms of en/decoding speed and size. Do you have any thoughts on the encoder to use?

@oruebel
Copy link
Contributor

oruebel commented Feb 29, 2024

@magland is this something you are interested in helping with by contributing a PR or would you like us to take a look at this?

@magland
Copy link
Author

magland commented Feb 29, 2024

@oruebel I am happy to create a PR, but I will need some help knowing what to change.

Now that you pointed to require_dataset I am seeing lines like this:

dset = parent.require_dataset(name,
                                              shape=shape,
                                              dtype=object,
                                              object_codec=self.__codec_cls(),
                                              **options['io_settings'])

and

self.__codec_cls = numcodecs.pickles.Pickle if object_codec_class is None else object_codec_class

That should give me something to go on.

I'll working on putting something together for you to look at.

@oruebel
Copy link
Contributor

oruebel commented Feb 29, 2024

I'll working on putting something together for you to look at.

Great! Thanks for the help!

A first test could be to just set the object_codec_class parameter when creating the ZarrIO object via ZarrIO.__init__. If that produces files that work for you, then this may be as simple as changing the default for the object codec here:

self.__codec_cls = numcodecs.pickles.Pickle if object_codec_class is None else object_codec_class

I think there may be a couple of places where the object_codec is not being set. I'm not sure right now if we may need to add the object codec as an additional argument to those calls, or whether those calls don't create datasets that require object encoding. However, I think object codec is being ignored by Zarr for all other data (i.e., if we don't store objects), so I think it may be safe to just add it to those calls, from a quick look, I think these are the two calls don't specify the object codec:

dset = parent.require_dataset(name, shape=data_shape, dtype=dtype, **io_settings)

dset = parent.require_dataset(name, shape=(1, ), dtype=dtype, **io_settings)

In addition to datasets, another place where encoding could potentially come up is when writing attributes. I don't think this should be an issue, because a) Zarr stores attributes in JSON, and b) I don't think NWB stores objects in attributes. However, I wanted to mention it just in case. If attribute encoding should come up as an issue, those are written here:

def write_attributes(self, **kwargs):

@magland
Copy link
Author

magland commented Mar 5, 2024

Thanks @oruebel

I tried out changing the default codec class in ZarrIO to numcodecs.JSON as you suggested, and indeed that worked! At least for the dataset I tested.

I'll make a PR and we'll see if it passes the tests.

@mavaylon1 mavaylon1 self-assigned this Apr 16, 2024
@mavaylon1 mavaylon1 added category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s) labels Apr 16, 2024
@mavaylon1 mavaylon1 added this to the 0.8.0 milestone Apr 16, 2024
@rly rly modified the milestones: Next Major Release: 1.0.0, 1.1.0 Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants