Split large unchunked contiguous HDF5 datasets into smaller chunks #84

magland · 2024-07-09T14:10:13Z

This is a third attempt at #82 and #83

This solves the following important problem. Some of the large (raw ephys) datasets on DANDI are uncompressed and stored in a single very large chunk within the NWB file. For example 000935. In fact, the default in HDF5 is to create an uncompressed contiguous chunk. This is a problem for Zarr with remote files because (as far as I can see) there is no such thing as a partial read of a Zarr chunk. So if I try to randomly access a time segment with one of these arrays, the entire chunk/dataset needs to be downloaded.

This PR modifies LindiH5ZarrStore so that such datasets are effectively split into smaller chunks within the reference file system.

codecov-commenter · 2024-07-09T14:13:28Z

Codecov Report

Attention: Patch coverage is 87.50000% with 22 lines in your changes missing coverage. Please review.

Project coverage is 80.97%. Comparing base (731fcb5) to head (3b9f2e9).
Report is 9 commits behind head on main.

Files	Patch %	Lines
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py	85.03%	19 Missing ⚠️
lindi/LindiH5ZarrStore/_util.py	75.00%	1 Missing ⚠️
lindi/LindiH5pyFile/LindiH5pyFile.py	75.00%	1 Missing ⚠️
...ndi/LindiH5pyFile/LindiReferenceFileSystemStore.py	97.43%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #84      +/-   ##
==========================================
+ Coverage   79.43%   80.97%   +1.54%     
==========================================
  Files          30       30              
  Lines        2256     2423     +167     
==========================================
+ Hits         1792     1962     +170     
+ Misses        464      461       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…emStore

magland · 2024-07-10T11:58:55Z

Ran into a difficulty that required a hacky solution in this commit
a6ebaac

The problem is with the final chunk in the subdivided continuous chunk. Unless the chunk size divides evenly into the total shape (rarely the case) the final chunk in the HDF5 file will be smaller than the chunk size reported in .zarray. That's fine for HDF5. But for Zarr, all the chunk files need to be the same size... and the extra data in the final chunk is discarded. This is a problem for the reference file system since there is no way to specify that a particular byte range (in the HDF5 file) needs to be padded by extra zeros. After thinking about it for a while I concluded that we either need to

(a) Expand the capabilities of the reference file system to specify this extra padding (clumsy and unnatural)
(b) Apply the following hack (which is what I did):

For the special Zarr store that is backed by the reference file system, automatically detect when the chunk data is smaller than expected based on the .zarray metadata, and in that case, pad the file with the necessary zeros.

I needed to implement this for both of the custom Zarr stores: LindiReferenceFileSystemStore and LindiH5ZarrStore

It's hard to remember what the difference between these are, so I'll summarize:

LindiReferenceFileSystemStore: This is when you have a .lindi.json file (reference file system) and you want to represent it as LINDI Zarr and then wrap that in LindH5pyFile.

LindiH5ZarrStore: This is when you have an HDF5 file (e.g., a remote file on DANDI) and you want to represent it as LINDI Zarr, usually for the purpose of generating a .lindi.json file.

bendichter · 2024-07-16T18:42:09Z

I ran into this problem also. If I understand your solution, you are doing the same thing I did, which is to chunk contiguous arrays along the 0th dimension. So if your array is (time * channel) then you are imposing artificial chunking along time and not channels. This solution does work, but can lead to problems when you need to index the other dimensions. E.g., if you want to access all time for a given channel, you'll need to read the entire dataset for all channels, and there is no way to adjust the chunking to make this more efficient, since this solution only allows you to chunk the 0th dimension.

I think a long-term solution might be to build contiguous array support into Zarr directly. It can be expressed easily in Zarr (chunks=null, filters=null), and it should be pretty easy to implement since it's really just a simple memmap.

magland · 2024-07-17T14:33:42Z

@bendichter
Just want to be clear. Are you proposing modifying Zarr? So this would be something external to lindi? And then once that's in place, then we no longer need to use the hack of this PR?

Also, see my comment above about what I needed to do for the final chunk (so it wasn't completely straightforward unfortunately).

bendichter · 2024-07-17T14:44:08Z

Yes, I am suggesting we try to fix this on the Zarr side. I think it would be worth discussing with them and seeing if they would be open to this feature.

magland · 2024-07-17T15:45:39Z

Focusing on the case of a remote file, where each read of a byte range requires a separate http request...
... I believe your suggestion only applies in the case where you are loading a single time frame and a subset of channels (assume the first dimension is time and second dimension is channels). If you are trying to load a subset of time frames and a subset of channels, then you can't get around the fact that the data are one contiguous block, because it is inefficient to make a separate request for each time frame -- it's usually still best to load data for all the channels, and then slice locally. Does this make sense?

bendichter · 2024-07-17T20:56:02Z

OK I see your point. When accessing data on disk you can use stride, but when accessing remote data, the http request needs to be continuous, so you'll need to download that data and bring it into RAM regardless. You might be able to do some caching so you don't need to read the entire continuous dataset into RAM, but you'll still be forced to download the entire dataset, which is the expensive part anyway. You could effectively download strides by sending multiple requests, but then you'd send too many requests.

There would still be two benefits of supporting direct read for non-chunked data in remote settings. I'm not sure either of them are game-changers, but anyway I thought they were worth mentioning:

When requesting small ranges, you don't need to obey arbitrary chunk boundaries. When you use artificial chunking, if you chunk the 0th dimension by 100 and then you want to read items [99:101,...], you'll need to request the entire first and second chunk. With direct read, you could request just the range of interest, which would be much less data. You can think of cases where this would save a lot of bandwidth, but I'm not sure how often those cases would come up in practice.
When requesting large ranges, you can set request size on the fly. For some systems you might want many smaller range requests while for other systems you might want fewer larger requests. When you impose chunks on the fsspec file, you lose the flexibility to choose this on the fly. If you simply say this is continuous, you could have the system decide how it wants to request the data, and it does not need to be limited to multiples of the chunk size.

magland added 2 commits July 9, 2024 10:07

Split large unchunked contiguous HDF5 datasets into smaller chunks

3bd8903

relax timing test

9f9a2d4

Add auto chunk padding in LindiH5ZarrStore and LindiReferenceFileSyst…

a6ebaac

…emStore

magland requested a review from rly July 10, 2024 12:00

magland added 3 commits July 13, 2024 11:13

Fix split chunk size calculations

ac7acd2

Remove debug print statements from SplitDatasetH5Item

75a95d4

Bump version to 0.3.13 in pyproject.toml

ffcd45a

Add method to retrieve keys in LindiH5pyFile class

af32db8

magland added 3 commits July 17, 2024 20:37

Implement items() for LindiH5pyFile

24c5371

Merge branch 'impl-keys' into split-contiguous-datasets

1db7e81

Bump version to 0.3.14 in pyproject.toml

b7fb391

magland mentioned this pull request Jul 18, 2024

Add method to retrieve keys in LindiH5pyFile class #87

Closed

magland added 2 commits July 20, 2024 13:30

Improve code comments

18e3ecc

_get_padded_size()

3b9f2e9

magland mentioned this pull request Aug 1, 2024

single file binary lindi #88

Closed

rly mentioned this pull request Aug 6, 2024

[Feature]: Chunk all TimeSeries data and timestamps by default NeurodataWithoutBorders/pynwb#1945

Open

3 tasks

magland merged commit 3b9f2e9 into main Sep 17, 2024
6 checks passed

magland deleted the split-contiguous-datasets branch September 17, 2024 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split large unchunked contiguous HDF5 datasets into smaller chunks #84

Split large unchunked contiguous HDF5 datasets into smaller chunks #84

magland commented Jul 9, 2024 •

edited

Loading

codecov-commenter commented Jul 9, 2024 •

edited

Loading

magland commented Jul 10, 2024 •

edited

Loading

bendichter commented Jul 16, 2024

magland commented Jul 17, 2024

bendichter commented Jul 17, 2024

magland commented Jul 17, 2024

bendichter commented Jul 17, 2024

Split large unchunked contiguous HDF5 datasets into smaller chunks #84

Split large unchunked contiguous HDF5 datasets into smaller chunks #84

Conversation

magland commented Jul 9, 2024 • edited Loading

codecov-commenter commented Jul 9, 2024 • edited Loading

Codecov Report

magland commented Jul 10, 2024 • edited Loading

bendichter commented Jul 16, 2024

magland commented Jul 17, 2024

bendichter commented Jul 17, 2024

magland commented Jul 17, 2024

bendichter commented Jul 17, 2024

magland commented Jul 9, 2024 •

edited

Loading

codecov-commenter commented Jul 9, 2024 •

edited

Loading

magland commented Jul 10, 2024 •

edited

Loading