-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix conversion of Zarr string dataset to HDF5 #1171
base: dev
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #1171 +/- ##
==========================================
- Coverage 88.88% 88.83% -0.06%
==========================================
Files 45 45
Lines 9835 9841 +6
Branches 2795 2798 +3
==========================================
Hits 8742 8742
- Misses 776 779 +3
- Partials 317 320 +3 ☔ View full report in Codecov by Sentry. |
@rly could you please take a look to see if this is right way to address this issue |
if hasattr(data, 'attrs') and 'zarr_dtype' in data.attrs and data.attrs['zarr_dtype'] == 'str': | ||
return cls.__dtypes['text'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add:
try:
from zarr import Array as ZarrArray
import numcodecs
ZARR_INSTALLED = True
except ImportError:
ZARR_INSTALLED = False
def _is_zarr_array(value):
return ZARR_INSTALLED and isinstance(value, ZarrArray)
Can we change these lines to:
if _is_zarr_array(data) and data.filters:
if numcodecs.VLenUTF8() in data.filters:
return cls.__dtypes['text']
elif numcodecs.VLenBytes() in data.filters:
return cls.__dtypes['ascii']
It would be nice if HDF5IO
can accept any Zarr array just like it accepts any Numpy array, instead of accepting only Zarr arrays that have been written with hdmf-zarr. The Zarr array filters should show that the array dtype is a variable-length string and its encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I don't seem to be able to create a test case where these lines (regardless of yours or mine) matters. Could you add one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note from discussion: this probably can be removed and was there to be overly cautious. Ryan TODO: try a conversion and confirm.
Motivation
Fix hdmf-dev/hdmf-zarr#211
Zarr does not natively support variable length strings, instead the data is stored as objects with an encoder. When converting from Zarr to HDF5 the
ObjectMapper
tries to change the dtype in some cases for text datasets to unicode, which in turn can cause reading the data from Zarr to fail. This PR changes this behavior to: 1) not change the dtype for Zarr string datasets, and 2) detect the correct dtype to use in HDF5 in the backend.Checklist
CHANGELOG.md
with your changes?