-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split large unchunked contiguous HDF5 datasets into smaller chunks #84
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #84 +/- ##
==========================================
+ Coverage 79.43% 80.97% +1.54%
==========================================
Files 30 30
Lines 2256 2423 +167
==========================================
+ Hits 1792 1962 +170
+ Misses 464 461 -3 ☔ View full report in Codecov by Sentry. |
Ran into a difficulty that required a hacky solution in this commit The problem is with the final chunk in the subdivided continuous chunk. Unless the chunk size divides evenly into the total shape (rarely the case) the final chunk in the HDF5 file will be smaller than the chunk size reported in .zarray. That's fine for HDF5. But for Zarr, all the chunk files need to be the same size... and the extra data in the final chunk is discarded. This is a problem for the reference file system since there is no way to specify that a particular byte range (in the HDF5 file) needs to be padded by extra zeros. After thinking about it for a while I concluded that we either need to (a) Expand the capabilities of the reference file system to specify this extra padding (clumsy and unnatural) For the special Zarr store that is backed by the reference file system, automatically detect when the chunk data is smaller than expected based on the .zarray metadata, and in that case, pad the file with the necessary zeros. I needed to implement this for both of the custom Zarr stores: LindiReferenceFileSystemStore and LindiH5ZarrStore It's hard to remember what the difference between these are, so I'll summarize: LindiReferenceFileSystemStore: This is when you have a .lindi.json file (reference file system) and you want to represent it as LINDI Zarr and then wrap that in LindH5pyFile. LindiH5ZarrStore: This is when you have an HDF5 file (e.g., a remote file on DANDI) and you want to represent it as LINDI Zarr, usually for the purpose of generating a .lindi.json file. |
I ran into this problem also. If I understand your solution, you are doing the same thing I did, which is to chunk contiguous arrays along the 0th dimension. So if your array is (time * channel) then you are imposing artificial chunking along time and not channels. This solution does work, but can lead to problems when you need to index the other dimensions. E.g., if you want to access all time for a given channel, you'll need to read the entire dataset for all channels, and there is no way to adjust the chunking to make this more efficient, since this solution only allows you to chunk the 0th dimension. I think a long-term solution might be to build contiguous array support into Zarr directly. It can be expressed easily in Zarr (chunks=null, filters=null), and it should be pretty easy to implement since it's really just a simple memmap. |
@bendichter Also, see my comment above about what I needed to do for the final chunk (so it wasn't completely straightforward unfortunately). |
Yes, I am suggesting we try to fix this on the Zarr side. I think it would be worth discussing with them and seeing if they would be open to this feature. |
Focusing on the case of a remote file, where each read of a byte range requires a separate http request... |
OK I see your point. When accessing data on disk you can use stride, but when accessing remote data, the http request needs to be continuous, so you'll need to download that data and bring it into RAM regardless. You might be able to do some caching so you don't need to read the entire continuous dataset into RAM, but you'll still be forced to download the entire dataset, which is the expensive part anyway. You could effectively download strides by sending multiple requests, but then you'd send too many requests. There would still be two benefits of supporting direct read for non-chunked data in remote settings. I'm not sure either of them are game-changers, but anyway I thought they were worth mentioning:
|
This is a third attempt at #82 and #83
This solves the following important problem. Some of the large (raw ephys) datasets on DANDI are uncompressed and stored in a single very large chunk within the NWB file. For example 000935. In fact, the default in HDF5 is to create an uncompressed contiguous chunk. This is a problem for Zarr with remote files because (as far as I can see) there is no such thing as a partial read of a Zarr chunk. So if I try to randomly access a time segment with one of these arrays, the entire chunk/dataset needs to be downloaded.
This PR modifies LindiH5ZarrStore so that such datasets are effectively split into smaller chunks within the reference file system.