Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify zarr codec when creating a dataset #35

Merged
merged 6 commits into from
Apr 19, 2024
Merged

Specify zarr codec when creating a dataset #35

merged 6 commits into from
Apr 19, 2024

Conversation

magland
Copy link
Collaborator

@magland magland commented Apr 3, 2024

The goal here is to allow writing datasets with Zarr-supported codecs that are not available with hdf5 (e.g., zstd and custom codecs). Obviously, the h5py API does not have such a mechanism.

I considered adding a zarr_compressor parameter to create_dataset() but I thought it's best not to make the interface of that function differ between LindiH5pyFile and h5py.File. So instead I made a new function called create_dataset_with_zarr_compressor(..., compressor=codec)

@magland magland requested a review from rly April 3, 2024 20:39
@codecov-commenter
Copy link

codecov-commenter commented Apr 3, 2024

Codecov Report

Attention: Patch coverage is 47.61905% with 22 lines in your changes are missing coverage. Please review.

Project coverage is 81.53%. Comparing base (bc71a9d) to head (22fc137).

Files Patch % Lines
...indi/LindiH5pyFile/writers/LindiH5pyGroupWriter.py 46.66% 16 Missing ⚠️
...ndi/conversion/create_zarr_dataset_from_h5_data.py 50.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #35      +/-   ##
==========================================
- Coverage   82.33%   81.53%   -0.80%     
==========================================
  Files          25       25              
  Lines        1715     1755      +40     
==========================================
+ Hits         1412     1431      +19     
- Misses        303      324      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rly
Copy link
Contributor

rly commented Apr 4, 2024

This looks reasonable to me, but how this would be used? It seems like this would be used for write outside of PyNWB, which is fine, but I am not totally seeing the use case.

@magland
Copy link
Collaborator Author

magland commented Apr 4, 2024

This looks reasonable to me, but how this would be used? It seems like this would be used for write outside of PyNWB, which is fine, but I am not totally seeing the use case.

@rly I am thinking specifically about three codecs I would like to use. It's true that this would not be possible via pynwb at this point (it would need to be manual creating of datasets) but maybe at some point in the future this could be somehow supported by pynwb.

The three codecs are:

  • zstd - compared with gzip, it seems to be faster for encoding/decoding and the CR is higher.
  • mp4/h.264 video compression - working on this here
  • Lossy compression of raw ecephys - see this

tag @bendichter

@oruebel
Copy link

oruebel commented Apr 4, 2024

It's true that this would not be possible via pynwb at this point

With the ZarrIO backend you can specify dataset codecs via ZarrDataIO https://github.com/hdmf-dev/hdmf-zarr/blob/4c9af4a56336da2c421531dfa4487a4b1ff7977a/src/hdmf_zarr/utils.py#L399 If I remember correctly you are here going through HDF5IO so you would need a way to pass the code through via H5DataIO https://github.com/hdmf-dev/hdmf/blob/d85d0cbc36d2e0fdb25e8fbea14d58ba7bf24a40/src/hdmf/backends/hdf5/h5_utils.py#L431 You could make a derived class for H5DataIO for LINDI.

Base automatically changed from write to main April 4, 2024 18:53
@magland
Copy link
Collaborator Author

magland commented Apr 18, 2024

note: made a plan with ryan...

@magland
Copy link
Collaborator Author

magland commented Apr 19, 2024

After chatting with @rly , we decided not to introduce a new special function, but to allow the compression parameter (which is supported by h5py) to be either a string or a numcodecs Codec. In the case of a Codec, this is passed through to zarr as the compressor parameter. In the case of a string, the following logic is used:

        elif isinstance(compression, str):
            if compression == 'gzip':
                if compression_opts is None:
                    level = 4  # default for h5py
                elif isinstance(compression_opts, int):
                    level = compression_opts
                else:
                    raise Exception(f'Unexpected type for compression_opts: {type(compression_opts)}')
                _zarr_compressor = numcodecs.GZip(level=level)
            else:
                raise Exception(f'Compression {compression} is not supported')

Then _zarr_compressor is passed through to zarr.

@rly
Copy link
Contributor

rly commented Apr 19, 2024

Looks good. We can add support for other h5py compressions, including from hdf5plugin, later as needed.

@magland magland merged commit 3c4562f into main Apr 19, 2024
6 checks passed
@magland magland deleted the codec branch April 19, 2024 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants