-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use chunk_iter when adding chunk refs #68
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #68 +/- ##
==========================================
- Coverage 81.35% 79.53% -1.83%
==========================================
Files 30 30
Lines 2194 2238 +44
==========================================
- Hits 1785 1780 -5
- Misses 409 458 +49 ☔ View full report in Codecov by Sentry. |
Pushed some small changes to keep the linter happy (pyright / flake8). Passes tests now. ... continuing my review... |
@rly This all looks great |
@magland great! thanks for the look over In
when would chunks[i] ever be 0?
When I try to create an h5py dataset with a chunk dimension length = 0, I get an error: with h5py.File("test.h5", "w") as f:
f.create_dataset("data", shape=(2,2,2), dtype="i4", chunks=(1,1,0))
ValueError: All chunk dimensions must be positive (all chunk dimensions must be positive) |
Maybe related: the line above says > root = zarr.open("test.zarr", mode="w")
> data = root.create_dataset("data", shape=(0, 0), chunks=None)
> data.chunks
(1, 1) |
Though in Zarr, you can explicitly set the chunk dimension lengths to 0... > root = zarr.open("test.zarr", mode="w")
> data = root.create_dataset("data", shape=(0, 0), chunks=(0, 0))
> data.chunks
(0, 0) |
I have the following in comment elsewhere in the code (it should probably be pasted above this line as well): "the shape could be zero -- for example dandiset 000559 - acquisition/depth_video/data has shape [0, 0, 0]" I'm not sure how the shape got to be that... but you can check that dandiset. |
I just removed that chunk of code and wanted to make sure I wasn't missing an unusual edge case. I think the modification would work for this dataset, but I'll add a test. |
I just updated I think a progress bar is useful. Annoyingly, |
This all sounds good to me. |
@rly is the ready to get merged in yet? Is there something I can help with? |
I wanted to add some tests but let's go ahead and merge this so that we can use it in the benchmarks for the proposal. We can add tests in a separate PR. |
Okay, I just published 0.3.5 from main branch. |
Fix #67.
I added
lindi.LindiH5ZarrStore._util._get_all_chunk_info
.LindiH5ZarrStore._get_chunk_file_bytes_data
was being called in two places:to_reference_file_system.process_dataset
in a loop over all chunks to get the chunk info for each chunk to add to the refs dict_get_chunk_file_bytes
which is called by__getitem__
for a single keyBecause the new
chunk_iter
method returns the chunk info for all chunks and takes 1-5 seconds, it is best to use that in use case 1 above. The oldget_chunk_info
method takes about 0.2 seconds. It is best to use that for use case 2 above when a user wants to access a single key.TODO: if the user wants to access more than ~10 keys for a given dataset, e.g., they are requesting a large slice, then probably it would be best to use the
chunk_iter
method and cache the chunk info inLindiH5ZarrStore
. Alternatively, we could just always runchunk_iter
and cache the chunk info.I tried putting a bunch of if/else branches in
LindiH5ZarrStore._get_chunk_file_bytes_data
to handle the two use cases above, but it got messy and complex. So I just created a new function_add_chunk_info_to_refs
for use case 1 which has some of the same code as_get_chunk_file_bytes_data
. It needs to be refactored.This also adds
tqdm
as a dependency for showing a progress bar when iterating through many chunks. I think that is useful overall, but we could move that to an optional dependency if we think it is not so important.@magland before I continue, can you take a look at this and let me know if I am understanding the code correctly and if the approach looks good?
TODO: