Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue6: support for loading only chunks needed for a selection #58

Closed
wants to merge 18 commits into from

Conversation

bnlawrence
Copy link

This code supports data access of the form dataset[4:6,12:15] to read only the necessary chunks of a chunked dataste to satisfy the request. It does not make any changes to behaviour for contiguous data access.

It makes use of a copy of the Zarr Indexer, which is pulled out of zarr to avoid a dependency on a large and active package together with a cached copy of the b-tree index (cached per dataset). There is a new subclass of Dataobjects to facilitate the data access. It is not clear whether new unit tests are necessary to test this, insofar as the existing unit tests do a getitem on chunked datsets anyway.

It provides a public API to the chunk addresses required by the indexed selection to support the requirements of another package where we are also trying to avoid depdencies on Zarr.

It addresses most of the requirements of #6, but I didn't dare address the contiguous data memory mapper ...

@bnlawrence
Copy link
Author

bnlawrence commented Mar 3, 2024

Ok, well, there's an embarassing error in that code, insofar as the new chunking code isn't used at all due to this line:

        data = self._dataobjects.get_data()[args]

which now should be

        data = self._dataobjects.get_data(args)

but isn't in this pull request. When it is corrected, I start to fail some unit tests, which I'll need to address before updating this pull request. Apologies.

(I guess the good news is that my expectation that the existing unit tests ought to test this code is at least partially correct.)

…changes to actually using the filter pipeline. At this point is failling test_reference.
@bnlawrence
Copy link
Author

(Just one test still to investigate why it's failing.)

@bnlawrence
Copy link
Author

I don't understand what HDF references are, so I have bypassed the "chunk by chunk" code when there is a dtype which is a tuple. Hopefully someone who looks at this pull request can decide whether that is the right thing to do or can improve on what is done here.

@bmaranville
Copy link
Collaborator

the Reference dtype corresponds an address that points to an object (Group, Dataset, user-registered Type, or Link), where each address is a 64-bit unsigned int. I suppose it's probably legal to store a Dataset of References in chunked mode - it would follow the normal rules of chunked storage, but it's raw addresses that are stored in the chunks, which should be loaded into Reference objects.

There are other dtypes that get read into a tuple in pyfive, like VLEN_STRING and VLEN_SEQUENCE (see https://github.com/jjhelmus/pyfive/blob/master/pyfive/dataobjects.py#L199) but they are not currently handled when reading Datasets, only Attributes. These are also pointer addresses, but the dereferencing algorithm is slightly different than for Reference objects I think (see the dereferencing code inline in the attribute reader mentioned above),

Bryan Lawrence added 2 commits March 5, 2024 07:31
@bnlawrence
Copy link
Author

@bmaranville Ta. I am not sure whether it is worth persevering with that insofar as the current implementation in this branch simply does what it did before and it's not obvious to me whether you can avoid needing to load the entire set of references anyway, so there is no benefit in chunk by chunk reading in that case?

@bmaranville
Copy link
Collaborator

I agree, you can read reference datasets chunk by chunk if you want, but since each element is a pointer to another group or dataset it doesn't usually provide much performance benefit to having them in a chunked array (the data pointed to is probably much bigger than the pointers?).

@bnlawrence
Copy link
Author

I'm going to refactor this to respect as much of the new API that H5py introduced in v3 as possible. Details are here: NCAS-CMS#4 . It'd be good to know how, when that's done, we can get that this into the main branch!

@bnlawrence
Copy link
Author

Meanwhile, I'll close this pull request and generate a new one.

@bnlawrence bnlawrence closed this Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants