Merge branch 'dev' into write-references-array

hdmf-dev · Feb 2, 2024 · b33301c · b33301c
2 parents 9447447 + f5f09c8
commit b33301c
Show file tree

Hide file tree

Showing 11 changed files with 480 additions and 45 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,14 @@
 # HDMF-ZARR Changelog
 
+## 0.6.0 (Upcoming)
+
+### Enhancements
+* Enhanced `ZarrIO` and `ZarrDataIO` to infer io settings (e.g., chunking and compression) from HDF5 datasets to preserve storage settings on export if possible @oruebel [#153](https://github.com/hdmf-dev/hdmf-zarr/pull/153)
+
+### Bug Fixes
+* Fixed bug when converting HDF5 datasets with unlimited dimensions @oruebel [#155](https://github.com/hdmf-dev/hdmf-zarr/pull/155)
+* Adjust gallery tests to not fail on deprecation warnings from pandas. @rly [#157](https://github.com/hdmf-dev/hdmf-zarr/pull/157)
+
 ## 0.5.0 (December 8, 2023)
 
 ### Enhancements
@@ -32,11 +41,11 @@
 
 ### New Features
 * Added support, tests, and docs for using ``DirectoryStore``, ``TempStore``, and
-  ``NestedDirectoryStore`` Zarr storage backends with ``ZarrIO`` and ``NWBZarrIO``. 
+  ``NestedDirectoryStore`` Zarr storage backends with ``ZarrIO`` and ``NWBZarrIO``.
   @oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)
 
 ### Minor enhancements
-* Updated handling of references on read to simplify future integration of file-based Zarr 
+* Updated handling of references on read to simplify future integration of file-based Zarr
   stores (e.g., ZipStore or database stores). @oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)
 * Added ``can_read`` classmethod to ``ZarrIO``. @bendichter [#97](https://github.com/hdmf-dev/hdmf-zarr/pull/97)
 
@@ -45,7 +54,7 @@
   @oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)
 * Fixed CI testing of minimum and optional installation requirement. @rly
   [#99](https://github.com/hdmf-dev/hdmf-zarr/pull/99)
-* Updated tests to handle upcoming changes to ``HDMFIO``. @rly 
+* Updated tests to handle upcoming changes to ``HDMFIO``. @rly
   [#102](https://github.com/hdmf-dev/hdmf-zarr/pull/102)
 
 
@@ -66,25 +75,25 @@
   links/reference when moving Zarr files @oruebel [#46](https://github.com/hdmf-dev/hdmf-zarr/pull/46)
 * Fixed bugs in requirements defined in setup.py @oruebel [#46](https://github.com/hdmf-dev/hdmf-zarr/pull/46)
 * Fixed bug regarding Sphinx external links @mavaylon1 [#53](https://github.com/hdmf-dev/hdmf-zarr/pull/53)
-* Updated gallery tests to use test_gallery.py and necessary package dependencies 
+* Updated gallery tests to use test_gallery.py and necessary package dependencies
   @mavaylon1 [#53](https://github.com/hdmf-dev/hdmf-zarr/pull/53)
-* Updated dateset used in conversion tutorial, which caused warnings 
+* Updated dateset used in conversion tutorial, which caused warnings
   @oruebel [#56](https://github.com/hdmf-dev/hdmf-zarr/pull/56)
 
 ### Docs
-* Added tutorial illustrating how to create a new NWB file with NWBZarrIO 
+* Added tutorial illustrating how to create a new NWB file with NWBZarrIO
   @oruebel [#46](https://github.com/hdmf-dev/hdmf-zarr/pull/46)
-* Added docs for describing the mapping of HDMF schema to Zarr storage 
+* Added docs for describing the mapping of HDMF schema to Zarr storage
   @oruebel [#48](https://github.com/hdmf-dev/hdmf-zarr/pull/48)
-* Added ``docs/gallery/resources`` for storing local files used by the tutorial galleries 
+* Added ``docs/gallery/resources`` for storing local files used by the tutorial galleries
   @oruebel [#61](https://github.com/hdmf-dev/hdmf-zarr/pull/61)
-* Removed dependency on ``dandi`` library for data download in the conversion tutorial by storing the NWB files as 
+* Removed dependency on ``dandi`` library for data download in the conversion tutorial by storing the NWB files as
   local resources @oruebel [#61](https://github.com/hdmf-dev/hdmf-zarr/pull/61)
 
 ## 0.1.0 (August 23, 2022)
 
 ### New features
 
-* Created new optional Zarr-based I/O backend for writing files using Zarr's `zarr.store.DirectoryStore` backend, 
-  including support for iterative write, chunking, compression, simple and compound data types, links, object 
+* Created new optional Zarr-based I/O backend for writing files using Zarr's `zarr.store.DirectoryStore` backend,
+  including support for iterative write, chunking, compression, simple and compound data types, links, object
   references, namespace and spec I/O.
diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -1,10 +1,15 @@
-# pinned dependencies to reproduce an entire development environment to use HDMF, run HDMF tests, check code style,
-# compute coverage, and create test environments
-coverage==6.4.2
-flake8==5.0.4
-flake8-debugger==4.1.2
-flake8-print==5.0.0
-pytest==7.1.2
-pytest-cov==3.0.0
+# pinned dependencies to reproduce an entire development environment to use HDMF-Zarr,
+# run HDMF-Zarr tests, check code style,
+# compute coverage, and create test environments. note that depending on the version of python installed, different
+# versions of requirements may be installed due to package incompatibilities.
+#
+black==23.10.1
+codespell==2.2.6
+coverage==7.3.2
+hdf5plugin==4.3.0  # hdf5plugin is used to test conversion of plugin filters
+pre-commit==3.5.0
+pytest==7.4.3
+pytest-cov==4.1.0
 python-dateutil==2.8.2
-tox==3.25.1
+ruff==0.1.3
+tox==4.11.3
diff --git a/requirements-doc.txt b/requirements-doc.txt
@@ -1,4 +1,4 @@
-# dependencies to generate the documentation for HDMF
+# dependencies to generate the documentation for HDMF-Zarr
 matplotlib
 sphinx>=4  # improved support for docutils>=0.17
 sphinx_rtd_theme>=1  # <1 does not work with docutils>=0.17

diff --git a/requirements-opt.txt b/requirements-opt.txt
@@ -1,3 +1,3 @@
-tqdm==4.65.0
-fsspec==2023.10.0
-s3fs==2023.10.0
+tqdm==4.66.1
+fsspec==2023.12.2
+s3fs==2023.12.2
diff --git a/requirements.txt b/requirements.txt
@@ -1,7 +1,7 @@
 # pinned dependencies to reproduce an entire development environment to use HDMF-ZARR
-hdmf==3.9.0
-zarr==2.11.0
+hdmf==3.12.0
+zarr==2.16.1
 pynwb==2.5.0
-numpy==1.24.0
-numcodecs==0.11.0
+numpy==1.26.3
+numcodecs==0.12.1
 threadpoolctl==3.2.0
diff --git a/src/hdmf_zarr/backend.py b/src/hdmf_zarr/backend.py
@@ -344,8 +344,9 @@ def export(self, **kwargs):
         )
 
         if not isinstance(src_io, ZarrIO) and write_args.get('link_data', True):
-            raise UnsupportedOperation("Cannot export from non-Zarr backend %s to Zarr with write argument "
-                                       "link_data=True." % src_io.__class__.__name__)
+            raise UnsupportedOperation(f"Cannot export from non-Zarr backend { src_io.__class__.__name__} " +
+                                       "to Zarr with write argument link_data=True. "
+                                       + "Set write_args={'link_data': False}")
 
         write_args['export_source'] = src_io.source  # pass export_source=src_io.source to write_builder
         ckwargs = kwargs.copy()
@@ -938,6 +939,11 @@ def write_dataset(self, **kwargs):  # noqa: C901
         name = builder.name
         data = builder.data if force_data is None else force_data
         options = dict()
+        # Check if data is a h5py.Dataset to infer I/O settings if necessary
+        if ZarrDataIO.is_h5py_dataset(data):
+            # Wrap the h5py.Dataset in ZarrDataIO with chunking and compression settings inferred from the input data
+            data = ZarrDataIO.from_h5py_dataset(h5dataset=data)
+        # Separate data values and io_settings for write
         if isinstance(data, ZarrDataIO):
             options['io_settings'] = data.io_settings
             link_data = data.link_data
@@ -1202,9 +1208,8 @@ def __list_fill__(self, parent, name, data, options=None):  # noqa: C901
         io_settings = dict()
         if options is not None:
             dtype = options.get('dtype')
-            io_settings = options.get('io_settings')
-            if io_settings is None:
-                io_settings = dict()
+            if options.get('io_settings') is not None:
+                io_settings = options.get('io_settings')
         # Determine the dtype
         if not isinstance(dtype, type):
             try:
@@ -1219,9 +1224,16 @@ def __list_fill__(self, parent, name, data, options=None):  # noqa: C901
         # Determine the shape and update the dtype if necessary when dtype==object
         if 'shape' in io_settings:  # Use the shape set by the user
             data_shape = io_settings.pop('shape')
-        # If we have a numeric numpy array then use its shape
+        # If we have a numeric numpy-like array (e.g., numpy.array or h5py.Dataset) then use its shape
         elif isinstance(dtype, np.dtype) and np.issubdtype(dtype, np.number) or dtype == np.bool_:
-            data_shape = get_data_shape(data)
+            # HDMF's get_data_shape may return the maxshape of an HDF5 dataset which can include None values
+            # which Zarr does not allow for dataset shape. Check for the shape attribute first before falling
+            # back on get_data_shape
+            if hasattr(data, 'shape') and data.shape is not None:
+                data_shape = data.shape  
+            # This is a fall-back just in case. However this should not happen for standard numpy and h5py arrays 
+            else: # pragma: no cover
+                data_shape = get_data_shape(data) # pragma: no cover
         # Deal with object dtype
         elif isinstance(dtype, np.dtype):
             data = data[:]  # load the data in case we come from HDF5 or another on-disk data source we don't know

diff --git a/src/hdmf_zarr/utils.py b/src/hdmf_zarr/utils.py
@@ -461,13 +461,87 @@ def __init__(self, **kwargs):
             self.__iosettings['filters'] = filters
 
     @property
-    def link_data(self):
+    def link_data(self) -> bool:
+        """Bool indicating should it be linked to or copied. NOTE: Only applies to zarr.Array type data"""
         return self.__link_data
 
     @property
-    def io_settings(self):
+    def io_settings(self) -> dict:
+        """Dict with the io settings to use"""
         return self.__iosettings
 
+    @staticmethod
+    def from_h5py_dataset(h5dataset, **kwargs):
+        """
+        Factory method to create a ZarrDataIO instance from a h5py.Dataset.
+        The ZarrDataIO object wraps the h5py.Dataset and the io filter settings
+        are inferred from filters used in h5py such that the options in Zarr match
+        (if possible) the options used in HDF5.
+
+        :param dataset: h5py.Dataset object that should be wrapped
+        :type dataset: h5py.Dataset
+        :param kwargs: Other keyword arguments to pass to ZarrDataIO.__init__
+
+        :returns: ZarrDataIO object wrapping the dataset
+        """
+        filters = ZarrDataIO.hdf5_to_zarr_filters(h5dataset)
+        fillval = h5dataset.fillvalue if 'fillvalue' not in kwargs else kwargs.pop('fillvalue')
+        if isinstance(fillval, bytes): # bytes are not JSON serializable so use string instead
+            fillval = fillval.decode("utf-8")
+        chunks = h5dataset.chunks if 'chunks' not in kwargs else kwargs.pop('chunks')
+        re = ZarrDataIO(
+            data=h5dataset,
+            filters=filters,
+            fillvalue=fillval,
+            chunks=chunks,
+            **kwargs)
+        return re
+
+    @staticmethod
+    def hdf5_to_zarr_filters(h5dataset) -> list:
+        """From the given h5py.Dataset infer the corresponding filters to use in Zarr"""
+        # Based on https://github.com/fsspec/kerchunk/blob/617d9ce06b9d02375ec0e5584541fcfa9e99014a/kerchunk/hdf.py#L181
+        filters = []
+        # Check for unsupported filters
+        if h5dataset.scaleoffset:
+            # TODO: translate to  numcodecs.fixedscaleoffset.FixedScaleOffset()
+            warn( f"{h5dataset.name} HDF5 scaleoffset filter ignored in Zarr")
+        if h5dataset.compression in ("szip", "lzf"):
+            warn(f"{h5dataset.name} HDF5 szip or lzf compression ignored in Zarr")
+        # Add the shuffle filter if possible
+        if h5dataset.shuffle and h5dataset.dtype.kind != "O":
+            # cannot use shuffle if we materialised objects
+            filters.append(numcodecs.Shuffle(elementsize=h5dataset.dtype.itemsize))
+        # iterate through all the filters and add them to the list
+        for filter_id, properties in h5dataset._filters.items():
+            filter_id_str = str(filter_id)
+            if filter_id_str == "32001":
+                blosc_compressors = ("blosclz", "lz4", "lz4hc", "snappy", "zlib", "zstd")
+                (_1, _2, bytes_per_num, total_bytes, clevel, shuffle, compressor) = properties
+                pars = dict(
+                    blocksize=total_bytes,
+                    clevel=clevel,
+                    shuffle=shuffle,
+                    cname=blosc_compressors[compressor])
+                filters.append(numcodecs.Blosc(**pars))
+            elif filter_id_str == "32015":
+                filters.append(numcodecs.Zstd(level=properties[0]))
+            elif filter_id_str == "gzip":
+                filters.append(numcodecs.Zlib(level=properties))
+            elif filter_id_str == "32004":
+                warn(f"{h5dataset.name} HDF5 lz4 compression ignored in Zarr")
+            elif filter_id_str == "32008":
+                warn(f"{h5dataset.name} HDF5 bitshuffle compression ignored in Zarr")
+            elif filter_id_str == "shuffle": # already handled above
+                pass
+            else:
+                warn(f"{h5dataset.name} HDF5 filter id {filter_id} with properties {properties} ignored in Zarr.")
+        return filters
+
+    @staticmethod
+    def is_h5py_dataset(obj):
+        """Check if the object is an instance of h5py.Dataset without requiring import of h5py"""
+        return (obj.__class__.__module__, obj.__class__.__name__) == ('h5py._hl.dataset', 'Dataset')
 
 class ZarrReference(dict):
     """

diff --git a/test_gallery.py b/test_gallery.py
@@ -49,26 +49,30 @@ def _import_from_file(script):
     "length of electrodes. Your data may be transposed."
 )
 
-_deprication_warning_map = (
+_deprecation_warning_map = (
     'Classes in map.py should be imported from hdmf.build. Importing from hdmf.build.map will be removed '
     'in HDMF 3.0.'
 )
 
-_deprication_warning_fmt_docval_args = (
+_deprecation_warning_fmt_docval_args = (
     "fmt_docval_args will be deprecated in a future version of HDMF. Instead of using fmt_docval_args, "
     "call the function directly with the kwargs. Please note that fmt_docval_args "
     "removes all arguments not accepted by the function's docval, so if you are passing kwargs that "
     "includes extra arguments and the function's docval does not allow extra arguments (allow_extra=True "
     "is set), then you will need to pop the extra arguments out of kwargs before calling the function."
 )
 
-_deprication_warning_call_docval_func = (
+_deprecation_warning_call_docval_func = (
     "call the function directly with the kwargs. Please note that call_docval_func "
     "removes all arguments not accepted by the function's docval, so if you are passing kwargs that "
     "includes extra arguments and the function's docval does not allow extra arguments (allow_extra=True "
     "is set), then you will need to pop the extra arguments out of kwargs before calling the function."
 )
 
+_deprecation_warning_pandas_pyarrow_re = (
+    r"\nPyarrow will become a required dependency of pandas.*"
+)
+
 
 def run_gallery_tests():
     global TOTAL, FAILURES, ERRORS
@@ -96,13 +100,13 @@ def run_gallery_tests():
         try:
             with warnings.catch_warnings(record=True):
                 warnings.filterwarnings(
-                    "ignore", message=_deprication_warning_map, category=DeprecationWarning
+                    "ignore", message=_deprecation_warning_map, category=DeprecationWarning
                 )
                 warnings.filterwarnings(
-                    "ignore", message=_deprication_warning_fmt_docval_args, category=PendingDeprecationWarning
+                    "ignore", message=_deprecation_warning_fmt_docval_args, category=PendingDeprecationWarning
                 )
                 warnings.filterwarnings(
-                    "ignore", message=_deprication_warning_call_docval_func, category=PendingDeprecationWarning
+                    "ignore", message=_deprecation_warning_call_docval_func, category=PendingDeprecationWarning
                 )
                 warnings.filterwarnings(
                     "ignore", message=_experimental_warning_re, category=UserWarning
@@ -127,6 +131,10 @@ def run_gallery_tests():
                     # this warning is triggered when downstream code such as pynwb uses pkg_resources>=5.13
                     "ignore", message=_pkg_resources_declare_warning_re, category=DeprecationWarning
                 )
+                warnings.filterwarnings(
+                    # this warning is triggered from pandas
+                    "ignore", message=_deprecation_warning_pandas_pyarrow_re, category=DeprecationWarning
+                )
                 _import_from_file(script_abs)
         except Exception:
             print(traceback.format_exc())

diff --git a/tests/unit/base_tests_zarrio.py b/tests/unit/base_tests_zarrio.py
@@ -1584,7 +1584,8 @@ def close(self):
 
         with OtherIO(manager=get_foo_buildmanager()) as read_io:
             with ZarrIO(self.store[1], mode='w') as export_io:
-                msg = "Cannot export from non-Zarr backend OtherIO to Zarr with write argument link_data=True."
+                msg = ("Cannot export from non-Zarr backend OtherIO to Zarr with write argument link_data=True. "
+                       "Set write_args={'link_data': False}")
                 with self.assertRaisesWith(UnsupportedOperation, msg):
                     export_io.export(src_io=read_io, container=foofile)