-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
default object_codec_class -> JSON #173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add a line to the CHANGELOG https://github.com/hdmf-dev/hdmf-zarr/blob/dev/CHANGELOG.md
I think the main reason I didn't use JSON at the time was because of cases where objects are not JSON serializable. However, I don't remember whether that was an actual issue I encountered or whether I was just being overly cautious and used Pickle because I knew it would work. But I think if the test suite passes we should be Ok. |
Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0) |
Please add it under a new heading "0.7.0 (upcoming)" since 0.6.0 has already been released |
Done. |
Now that I made the actual modification, some tests are failing on my fork. So I guess the process is going to be more involved. |
It looks like there are two main types of errors related to the JSON encoding:
|
The point of failure for this error appears to be in all cases in the hdmf-zarr/src/hdmf_zarr/backend.py Lines 1055 to 1061 in 556ed12
#146 updated the writing of compound datasets to fix the determination of the dtype and may be relevant here. |
Thanks, I'm working on a potential solution. |
@oruebel I have pushed a solution where it resorts to a Pickle codec in certain (hopefully rare) cases, and prints a warning. The cases are: (a) structured array and (b) compound dtype with embedded references For the purposes of neurosift, I am okay with this, because it seems to be a rare occurrence. But I wasn't sure whether you wanted to either suppress the warnings or try to serialize this in a different way. |
Thanks a lot! I think it would be great to see if we can use JSON for those cases too. The TimeSeriesReferenceVectorData is probably the main case in NWB. This is a compound of two ints and a reference. References are stored as a jSOn string, so I think those should be serializable with JSON too. @mavaylon1 could you take a look at this PR to see if we can use JSON serialization for the compound data case that @magland mentioned as well. It would be good to be consistent in how we serialize objects. I think thus is related to the recent PR on compound data types. |
I found this particularly challenging example in the wild. It seems it has a dataset that is a list of lists with embedded references. 000713 - Allen Institute - Visual Behavior - Neuropixels https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/ intervals/grating_presentations/timeseries
|
Yes, that is a |
Note, in Zarr, the object reference is being represented as a JSON, so in Zarr the dtype for |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #173 +/- ##
==========================================
+ Coverage 86.04% 86.07% +0.03%
==========================================
Files 5 5
Lines 1161 1178 +17
Branches 296 303 +7
==========================================
+ Hits 999 1014 +15
Misses 107 107
- Partials 55 57 +2 ☔ View full report in Codecov by Sentry. |
for c in np.ndindex(data_shape): | ||
o = data | ||
for i in c: | ||
o = o[i] | ||
if isinstance(o, np.void) and o.dtype.names is not None: | ||
has_structured_array = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems pretty slow, iterating through every value in the array to check type. in structured arrays, wouldn't you just have to check the dtype of the whole array? in any case shouldn't we break
the first time we get a True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also couldn't this switch happen here:
hdmf-zarr/src/hdmf_zarr/backend.py
Lines 1254 to 1258 in 29adcca
for substype in dtype.fields.items(): | |
if np.issubdtype(substype[1][0], np.flexible) or np.issubdtype(substype[1][0], np.object_): | |
dtype = object | |
io_settings['object_codec'] = self.__codec_cls() | |
break |
Can we turn the codec class into a getter method so we can encapsulate all switching logic in one place? currently there's: hdmf-zarr/src/hdmf_zarr/backend.py Line 154 in 29adcca
the logic of when to use which codec is split in a few places, but it seems to ultimately depend on this check here: hdmf-zarr/src/hdmf_zarr/backend.py Line 1054 in 29adcca
has its roots here: hdmf-zarr/src/hdmf_zarr/backend.py Line 955 in 29adcca
and similarly this check: hdmf-zarr/src/hdmf_zarr/backend.py Line 1279 in 29adcca
has its roots here: hdmf-zarr/src/hdmf_zarr/backend.py Line 1218 in 29adcca
called from here: hdmf-zarr/src/hdmf_zarr/backend.py Line 1071 in 29adcca
which is just the other leg of the above check, and L1276-1291 is very similar to L1037-1050.
not my package, but when i was working on this before i had left this note: hdmf-zarr/src/hdmf_zarr/backend.py Lines 1033 to 1036 in 29adcca
__resolve_dtype_helper__ method which could then make get_codec_cls() well defined as well.
|
@sneakers-the-rat some more details below, but to make a long story short, my hope is that we can:
@magland @sneakers-the-rat sorry that I have not had the chance to dive into the code for this PR yet, but those are things I hope we can do before merging this PR, i.e., try to do 1. if possible and implement 2. if we can't make 1. work.
TBH, my hope is that we can avoid having to use different codecs if possible. The use of
Currently the backend uses a single codec throughout, so a getter was not need, but if we really need to use multiple codes (which, again, I ideally would like to avoid if possible), then I totally agree that encapsulating this logic in a method is a good idea. We probably, then also need to get rid of hdmf-zarr/src/hdmf_zarr/backend.py Lines 95 to 97 in 897d3b9
Yes, I believe that should work. The |
@oruebel I'll dive in this week. |
Recap and Updates: |
I have an "almost" proof of concept for this. I will share this week if it holds up with more exploration. |
Motivation
See #171
In this PR I adjusted the default object_codec_class to numcodecs.JSON.
In addition, I also exposed this parameter in the constructor of NWBZarrIO
Fix #171
How to test the behavior?
I ran the following with my forked version, and verified that the resulting zarr opens properly in Neurosift:
Checklist
Note: I wasn't sure how to update CHANGELOG.md -- I mean where to put the new tet
ruff
from the source directory.