Skip to content

Commit

Permalink
Merge branch 'dev' into write-references-array
Browse files Browse the repository at this point in the history
  • Loading branch information
oruebel authored Feb 2, 2024
2 parents 9447447 + f5f09c8 commit b33301c
Show file tree
Hide file tree
Showing 11 changed files with 480 additions and 45 deletions.
31 changes: 20 additions & 11 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# HDMF-ZARR Changelog

## 0.6.0 (Upcoming)

### Enhancements
* Enhanced `ZarrIO` and `ZarrDataIO` to infer io settings (e.g., chunking and compression) from HDF5 datasets to preserve storage settings on export if possible @oruebel [#153](https://github.com/hdmf-dev/hdmf-zarr/pull/153)

### Bug Fixes
* Fixed bug when converting HDF5 datasets with unlimited dimensions @oruebel [#155](https://github.com/hdmf-dev/hdmf-zarr/pull/155)
* Adjust gallery tests to not fail on deprecation warnings from pandas. @rly [#157](https://github.com/hdmf-dev/hdmf-zarr/pull/157)

## 0.5.0 (December 8, 2023)

### Enhancements
Expand Down Expand Up @@ -32,11 +41,11 @@

### New Features
* Added support, tests, and docs for using ``DirectoryStore``, ``TempStore``, and
``NestedDirectoryStore`` Zarr storage backends with ``ZarrIO`` and ``NWBZarrIO``.
``NestedDirectoryStore`` Zarr storage backends with ``ZarrIO`` and ``NWBZarrIO``.
@oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)

### Minor enhancements
* Updated handling of references on read to simplify future integration of file-based Zarr
* Updated handling of references on read to simplify future integration of file-based Zarr
stores (e.g., ZipStore or database stores). @oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)
* Added ``can_read`` classmethod to ``ZarrIO``. @bendichter [#97](https://github.com/hdmf-dev/hdmf-zarr/pull/97)

Expand All @@ -45,7 +54,7 @@
@oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)
* Fixed CI testing of minimum and optional installation requirement. @rly
[#99](https://github.com/hdmf-dev/hdmf-zarr/pull/99)
* Updated tests to handle upcoming changes to ``HDMFIO``. @rly
* Updated tests to handle upcoming changes to ``HDMFIO``. @rly
[#102](https://github.com/hdmf-dev/hdmf-zarr/pull/102)


Expand All @@ -66,25 +75,25 @@
links/reference when moving Zarr files @oruebel [#46](https://github.com/hdmf-dev/hdmf-zarr/pull/46)
* Fixed bugs in requirements defined in setup.py @oruebel [#46](https://github.com/hdmf-dev/hdmf-zarr/pull/46)
* Fixed bug regarding Sphinx external links @mavaylon1 [#53](https://github.com/hdmf-dev/hdmf-zarr/pull/53)
* Updated gallery tests to use test_gallery.py and necessary package dependencies
* Updated gallery tests to use test_gallery.py and necessary package dependencies
@mavaylon1 [#53](https://github.com/hdmf-dev/hdmf-zarr/pull/53)
* Updated dateset used in conversion tutorial, which caused warnings
* Updated dateset used in conversion tutorial, which caused warnings
@oruebel [#56](https://github.com/hdmf-dev/hdmf-zarr/pull/56)

### Docs
* Added tutorial illustrating how to create a new NWB file with NWBZarrIO
* Added tutorial illustrating how to create a new NWB file with NWBZarrIO
@oruebel [#46](https://github.com/hdmf-dev/hdmf-zarr/pull/46)
* Added docs for describing the mapping of HDMF schema to Zarr storage
* Added docs for describing the mapping of HDMF schema to Zarr storage
@oruebel [#48](https://github.com/hdmf-dev/hdmf-zarr/pull/48)
* Added ``docs/gallery/resources`` for storing local files used by the tutorial galleries
* Added ``docs/gallery/resources`` for storing local files used by the tutorial galleries
@oruebel [#61](https://github.com/hdmf-dev/hdmf-zarr/pull/61)
* Removed dependency on ``dandi`` library for data download in the conversion tutorial by storing the NWB files as
* Removed dependency on ``dandi`` library for data download in the conversion tutorial by storing the NWB files as
local resources @oruebel [#61](https://github.com/hdmf-dev/hdmf-zarr/pull/61)

## 0.1.0 (August 23, 2022)

### New features

* Created new optional Zarr-based I/O backend for writing files using Zarr's `zarr.store.DirectoryStore` backend,
including support for iterative write, chunking, compression, simple and compound data types, links, object
* Created new optional Zarr-based I/O backend for writing files using Zarr's `zarr.store.DirectoryStore` backend,
including support for iterative write, chunking, compression, simple and compound data types, links, object
references, namespace and spec I/O.
23 changes: 14 additions & 9 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
# pinned dependencies to reproduce an entire development environment to use HDMF, run HDMF tests, check code style,
# compute coverage, and create test environments
coverage==6.4.2
flake8==5.0.4
flake8-debugger==4.1.2
flake8-print==5.0.0
pytest==7.1.2
pytest-cov==3.0.0
# pinned dependencies to reproduce an entire development environment to use HDMF-Zarr,
# run HDMF-Zarr tests, check code style,
# compute coverage, and create test environments. note that depending on the version of python installed, different
# versions of requirements may be installed due to package incompatibilities.
#
black==23.10.1
codespell==2.2.6
coverage==7.3.2
hdf5plugin==4.3.0 # hdf5plugin is used to test conversion of plugin filters
pre-commit==3.5.0
pytest==7.4.3
pytest-cov==4.1.0
python-dateutil==2.8.2
tox==3.25.1
ruff==0.1.3
tox==4.11.3
2 changes: 1 addition & 1 deletion requirements-doc.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# dependencies to generate the documentation for HDMF
# dependencies to generate the documentation for HDMF-Zarr
matplotlib
sphinx>=4 # improved support for docutils>=0.17
sphinx_rtd_theme>=1 # <1 does not work with docutils>=0.17
Expand Down
6 changes: 3 additions & 3 deletions requirements-opt.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
tqdm==4.65.0
fsspec==2023.10.0
s3fs==2023.10.0
tqdm==4.66.1
fsspec==2023.12.2
s3fs==2023.12.2
8 changes: 4 additions & 4 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# pinned dependencies to reproduce an entire development environment to use HDMF-ZARR
hdmf==3.9.0
zarr==2.11.0
hdmf==3.12.0
zarr==2.16.1
pynwb==2.5.0
numpy==1.24.0
numcodecs==0.11.0
numpy==1.26.3
numcodecs==0.12.1
threadpoolctl==3.2.0
26 changes: 19 additions & 7 deletions src/hdmf_zarr/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -344,8 +344,9 @@ def export(self, **kwargs):
)

if not isinstance(src_io, ZarrIO) and write_args.get('link_data', True):
raise UnsupportedOperation("Cannot export from non-Zarr backend %s to Zarr with write argument "
"link_data=True." % src_io.__class__.__name__)
raise UnsupportedOperation(f"Cannot export from non-Zarr backend { src_io.__class__.__name__} " +
"to Zarr with write argument link_data=True. "
+ "Set write_args={'link_data': False}")

write_args['export_source'] = src_io.source # pass export_source=src_io.source to write_builder
ckwargs = kwargs.copy()
Expand Down Expand Up @@ -938,6 +939,11 @@ def write_dataset(self, **kwargs): # noqa: C901
name = builder.name
data = builder.data if force_data is None else force_data
options = dict()
# Check if data is a h5py.Dataset to infer I/O settings if necessary
if ZarrDataIO.is_h5py_dataset(data):
# Wrap the h5py.Dataset in ZarrDataIO with chunking and compression settings inferred from the input data
data = ZarrDataIO.from_h5py_dataset(h5dataset=data)
# Separate data values and io_settings for write
if isinstance(data, ZarrDataIO):
options['io_settings'] = data.io_settings
link_data = data.link_data
Expand Down Expand Up @@ -1202,9 +1208,8 @@ def __list_fill__(self, parent, name, data, options=None): # noqa: C901
io_settings = dict()
if options is not None:
dtype = options.get('dtype')
io_settings = options.get('io_settings')
if io_settings is None:
io_settings = dict()
if options.get('io_settings') is not None:
io_settings = options.get('io_settings')
# Determine the dtype
if not isinstance(dtype, type):
try:
Expand All @@ -1219,9 +1224,16 @@ def __list_fill__(self, parent, name, data, options=None): # noqa: C901
# Determine the shape and update the dtype if necessary when dtype==object
if 'shape' in io_settings: # Use the shape set by the user
data_shape = io_settings.pop('shape')
# If we have a numeric numpy array then use its shape
# If we have a numeric numpy-like array (e.g., numpy.array or h5py.Dataset) then use its shape
elif isinstance(dtype, np.dtype) and np.issubdtype(dtype, np.number) or dtype == np.bool_:
data_shape = get_data_shape(data)
# HDMF's get_data_shape may return the maxshape of an HDF5 dataset which can include None values
# which Zarr does not allow for dataset shape. Check for the shape attribute first before falling
# back on get_data_shape
if hasattr(data, 'shape') and data.shape is not None:
data_shape = data.shape
# This is a fall-back just in case. However this should not happen for standard numpy and h5py arrays
else: # pragma: no cover
data_shape = get_data_shape(data) # pragma: no cover
# Deal with object dtype
elif isinstance(dtype, np.dtype):
data = data[:] # load the data in case we come from HDF5 or another on-disk data source we don't know
Expand Down
78 changes: 76 additions & 2 deletions src/hdmf_zarr/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -461,13 +461,87 @@ def __init__(self, **kwargs):
self.__iosettings['filters'] = filters

@property
def link_data(self):
def link_data(self) -> bool:
"""Bool indicating should it be linked to or copied. NOTE: Only applies to zarr.Array type data"""
return self.__link_data

@property
def io_settings(self):
def io_settings(self) -> dict:
"""Dict with the io settings to use"""
return self.__iosettings

@staticmethod
def from_h5py_dataset(h5dataset, **kwargs):
"""
Factory method to create a ZarrDataIO instance from a h5py.Dataset.
The ZarrDataIO object wraps the h5py.Dataset and the io filter settings
are inferred from filters used in h5py such that the options in Zarr match
(if possible) the options used in HDF5.
:param dataset: h5py.Dataset object that should be wrapped
:type dataset: h5py.Dataset
:param kwargs: Other keyword arguments to pass to ZarrDataIO.__init__
:returns: ZarrDataIO object wrapping the dataset
"""
filters = ZarrDataIO.hdf5_to_zarr_filters(h5dataset)
fillval = h5dataset.fillvalue if 'fillvalue' not in kwargs else kwargs.pop('fillvalue')
if isinstance(fillval, bytes): # bytes are not JSON serializable so use string instead
fillval = fillval.decode("utf-8")
chunks = h5dataset.chunks if 'chunks' not in kwargs else kwargs.pop('chunks')
re = ZarrDataIO(
data=h5dataset,
filters=filters,
fillvalue=fillval,
chunks=chunks,
**kwargs)
return re

@staticmethod
def hdf5_to_zarr_filters(h5dataset) -> list:
"""From the given h5py.Dataset infer the corresponding filters to use in Zarr"""
# Based on https://github.com/fsspec/kerchunk/blob/617d9ce06b9d02375ec0e5584541fcfa9e99014a/kerchunk/hdf.py#L181
filters = []
# Check for unsupported filters
if h5dataset.scaleoffset:
# TODO: translate to numcodecs.fixedscaleoffset.FixedScaleOffset()
warn( f"{h5dataset.name} HDF5 scaleoffset filter ignored in Zarr")
if h5dataset.compression in ("szip", "lzf"):
warn(f"{h5dataset.name} HDF5 szip or lzf compression ignored in Zarr")
# Add the shuffle filter if possible
if h5dataset.shuffle and h5dataset.dtype.kind != "O":
# cannot use shuffle if we materialised objects
filters.append(numcodecs.Shuffle(elementsize=h5dataset.dtype.itemsize))
# iterate through all the filters and add them to the list
for filter_id, properties in h5dataset._filters.items():
filter_id_str = str(filter_id)
if filter_id_str == "32001":
blosc_compressors = ("blosclz", "lz4", "lz4hc", "snappy", "zlib", "zstd")
(_1, _2, bytes_per_num, total_bytes, clevel, shuffle, compressor) = properties
pars = dict(
blocksize=total_bytes,
clevel=clevel,
shuffle=shuffle,
cname=blosc_compressors[compressor])
filters.append(numcodecs.Blosc(**pars))
elif filter_id_str == "32015":
filters.append(numcodecs.Zstd(level=properties[0]))
elif filter_id_str == "gzip":
filters.append(numcodecs.Zlib(level=properties))
elif filter_id_str == "32004":
warn(f"{h5dataset.name} HDF5 lz4 compression ignored in Zarr")
elif filter_id_str == "32008":
warn(f"{h5dataset.name} HDF5 bitshuffle compression ignored in Zarr")
elif filter_id_str == "shuffle": # already handled above
pass
else:
warn(f"{h5dataset.name} HDF5 filter id {filter_id} with properties {properties} ignored in Zarr.")
return filters

@staticmethod
def is_h5py_dataset(obj):
"""Check if the object is an instance of h5py.Dataset without requiring import of h5py"""
return (obj.__class__.__module__, obj.__class__.__name__) == ('h5py._hl.dataset', 'Dataset')

class ZarrReference(dict):
"""
Expand Down
20 changes: 14 additions & 6 deletions test_gallery.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,26 +49,30 @@ def _import_from_file(script):
"length of electrodes. Your data may be transposed."
)

_deprication_warning_map = (
_deprecation_warning_map = (
'Classes in map.py should be imported from hdmf.build. Importing from hdmf.build.map will be removed '
'in HDMF 3.0.'
)

_deprication_warning_fmt_docval_args = (
_deprecation_warning_fmt_docval_args = (
"fmt_docval_args will be deprecated in a future version of HDMF. Instead of using fmt_docval_args, "
"call the function directly with the kwargs. Please note that fmt_docval_args "
"removes all arguments not accepted by the function's docval, so if you are passing kwargs that "
"includes extra arguments and the function's docval does not allow extra arguments (allow_extra=True "
"is set), then you will need to pop the extra arguments out of kwargs before calling the function."
)

_deprication_warning_call_docval_func = (
_deprecation_warning_call_docval_func = (
"call the function directly with the kwargs. Please note that call_docval_func "
"removes all arguments not accepted by the function's docval, so if you are passing kwargs that "
"includes extra arguments and the function's docval does not allow extra arguments (allow_extra=True "
"is set), then you will need to pop the extra arguments out of kwargs before calling the function."
)

_deprecation_warning_pandas_pyarrow_re = (
r"\nPyarrow will become a required dependency of pandas.*"
)


def run_gallery_tests():
global TOTAL, FAILURES, ERRORS
Expand Down Expand Up @@ -96,13 +100,13 @@ def run_gallery_tests():
try:
with warnings.catch_warnings(record=True):
warnings.filterwarnings(
"ignore", message=_deprication_warning_map, category=DeprecationWarning
"ignore", message=_deprecation_warning_map, category=DeprecationWarning
)
warnings.filterwarnings(
"ignore", message=_deprication_warning_fmt_docval_args, category=PendingDeprecationWarning
"ignore", message=_deprecation_warning_fmt_docval_args, category=PendingDeprecationWarning
)
warnings.filterwarnings(
"ignore", message=_deprication_warning_call_docval_func, category=PendingDeprecationWarning
"ignore", message=_deprecation_warning_call_docval_func, category=PendingDeprecationWarning
)
warnings.filterwarnings(
"ignore", message=_experimental_warning_re, category=UserWarning
Expand All @@ -127,6 +131,10 @@ def run_gallery_tests():
# this warning is triggered when downstream code such as pynwb uses pkg_resources>=5.13
"ignore", message=_pkg_resources_declare_warning_re, category=DeprecationWarning
)
warnings.filterwarnings(
# this warning is triggered from pandas
"ignore", message=_deprecation_warning_pandas_pyarrow_re, category=DeprecationWarning
)
_import_from_file(script_abs)
except Exception:
print(traceback.format_exc())
Expand Down
3 changes: 2 additions & 1 deletion tests/unit/base_tests_zarrio.py
Original file line number Diff line number Diff line change
Expand Up @@ -1584,7 +1584,8 @@ def close(self):

with OtherIO(manager=get_foo_buildmanager()) as read_io:
with ZarrIO(self.store[1], mode='w') as export_io:
msg = "Cannot export from non-Zarr backend OtherIO to Zarr with write argument link_data=True."
msg = ("Cannot export from non-Zarr backend OtherIO to Zarr with write argument link_data=True. "
"Set write_args={'link_data': False}")
with self.assertRaisesWith(UnsupportedOperation, msg):
export_io.export(src_io=read_io, container=foofile)

Expand Down
Loading

0 comments on commit b33301c

Please sign in to comment.