Specify zarr codec when creating a dataset #35

magland · 2024-04-03T20:39:19Z

The goal here is to allow writing datasets with Zarr-supported codecs that are not available with hdf5 (e.g., zstd and custom codecs). Obviously, the h5py API does not have such a mechanism.

I considered adding a zarr_compressor parameter to create_dataset() but I thought it's best not to make the interface of that function differ between LindiH5pyFile and h5py.File. So instead I made a new function called create_dataset_with_zarr_compressor(..., compressor=codec)

codecov-commenter · 2024-04-03T20:40:00Z

Codecov Report

Attention: Patch coverage is 47.61905% with 22 lines in your changes are missing coverage. Please review.

Project coverage is 81.53%. Comparing base (bc71a9d) to head (22fc137).

Files	Patch %	Lines
...indi/LindiH5pyFile/writers/LindiH5pyGroupWriter.py	46.66%	16 Missing ⚠️
...ndi/conversion/create_zarr_dataset_from_h5_data.py	50.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #35      +/-   ##
==========================================
- Coverage   82.33%   81.53%   -0.80%     
==========================================
  Files          25       25              
  Lines        1715     1755      +40     
==========================================
+ Hits         1412     1431      +19     
- Misses        303      324      +21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rly · 2024-04-04T07:35:58Z

This looks reasonable to me, but how this would be used? It seems like this would be used for write outside of PyNWB, which is fine, but I am not totally seeing the use case.

magland · 2024-04-04T15:36:44Z

This looks reasonable to me, but how this would be used? It seems like this would be used for write outside of PyNWB, which is fine, but I am not totally seeing the use case.

@rly I am thinking specifically about three codecs I would like to use. It's true that this would not be possible via pynwb at this point (it would need to be manual creating of datasets) but maybe at some point in the future this could be somehow supported by pynwb.

The three codecs are:

zstd - compared with gzip, it seems to be faster for encoding/decoding and the CR is higher.
mp4/h.264 video compression - working on this here
Lossy compression of raw ecephys - see this

tag @bendichter

oruebel · 2024-04-04T16:35:31Z

It's true that this would not be possible via pynwb at this point

With the ZarrIO backend you can specify dataset codecs via ZarrDataIO https://github.com/hdmf-dev/hdmf-zarr/blob/4c9af4a56336da2c421531dfa4487a4b1ff7977a/src/hdmf_zarr/utils.py#L399 If I remember correctly you are here going through HDF5IO so you would need a way to pass the code through via H5DataIO https://github.com/hdmf-dev/hdmf/blob/d85d0cbc36d2e0fdb25e8fbea14d58ba7bf24a40/src/hdmf/backends/hdf5/h5_utils.py#L431 You could make a derived class for H5DataIO for LINDI.

magland · 2024-04-18T21:58:52Z

note: made a plan with ryan...

magland · 2024-04-19T01:27:35Z

After chatting with @rly , we decided not to introduce a new special function, but to allow the compression parameter (which is supported by h5py) to be either a string or a numcodecs Codec. In the case of a Codec, this is passed through to zarr as the compressor parameter. In the case of a string, the following logic is used:

        elif isinstance(compression, str):
            if compression == 'gzip':
                if compression_opts is None:
                    level = 4  # default for h5py
                elif isinstance(compression_opts, int):
                    level = compression_opts
                else:
                    raise Exception(f'Unexpected type for compression_opts: {type(compression_opts)}')
                _zarr_compressor = numcodecs.GZip(level=level)
            else:
                raise Exception(f'Compression {compression} is not supported')

Then _zarr_compressor is passed through to zarr.

rly · 2024-04-19T15:25:27Z

Looks good. We can add support for other h5py compressions, including from hdf5plugin, later as needed.

add create_dataset_with_zarr_compressor()

d58d935

magland requested a review from rly April 3, 2024 20:39

magland added 2 commits April 4, 2024 11:30

Merge branch 'write' into codec

799dd90

fix create_dataset_with_zarr_compressor

c294428

Base automatically changed from write to main April 4, 2024 18:53

magland added 2 commits April 9, 2024 16:44

Merge branch 'main' into codec

35c6722

Merge branch 'main' into codec

55cce74

compression parameter in create_dataset

22fc137

magland merged commit 3c4562f into main Apr 19, 2024
6 checks passed

magland deleted the codec branch April 19, 2024 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify zarr codec when creating a dataset #35

Specify zarr codec when creating a dataset #35

magland commented Apr 3, 2024

codecov-commenter commented Apr 3, 2024 •

edited

Loading

rly commented Apr 4, 2024

magland commented Apr 4, 2024

oruebel commented Apr 4, 2024

magland commented Apr 18, 2024

magland commented Apr 19, 2024

rly commented Apr 19, 2024

Specify zarr codec when creating a dataset #35

Specify zarr codec when creating a dataset #35

Conversation

magland commented Apr 3, 2024

codecov-commenter commented Apr 3, 2024 • edited Loading

Codecov Report

rly commented Apr 4, 2024

magland commented Apr 4, 2024

oruebel commented Apr 4, 2024

magland commented Apr 18, 2024

magland commented Apr 19, 2024

rly commented Apr 19, 2024

codecov-commenter commented Apr 3, 2024 •

edited

Loading