Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed improvements to loading HDF5 trees #25

Open
rayosborn opened this issue Mar 7, 2017 · 9 comments
Open

Speed improvements to loading HDF5 trees #25

rayosborn opened this issue Mar 7, 2017 · 9 comments
Assignees

Comments

@rayosborn
Copy link

In the nexusformat API, we load the entire HDF5 file tree by recursively walking through the groups in h5py, without reading in data values except for scalars and small arrays. On a local file, we can load files containing hundreds of objects without a significant time delay. For example, a file with 80 objects (groups, datasets, and attributes) takes 0.05s to load on my laptop. However, on h5pyd, the same load takes over 20s.

A call to load all the items in an HDF5 group requires two GET requests, and sometimes three, for each object, so there could be an improvement if all the metadata (shape, dtype, etc.) for each object were returned in a single call, and an even more significant one if all the items in a group could be returned with one GET request. Loading one group of 10 objects took 29 requests in my tests.

Binary data reads are fast, though.

@jreadey
Copy link
Member

jreadey commented Mar 8, 2017

I've added some caching logic to the group class. Try out this latest checkin: 1999417.

This is not a single operation recursive load, but I saw a speed up of about ~4x speed up walking the tree for the sample Nexus file. This is with using the hsls.py script in the app directory.

@jreadey
Copy link
Member

jreadey commented Mar 9, 2017

@rayosborn - did you get a chance to try this out?

@rayosborn
Copy link
Author

I have tested it, but I wasn't sure of the previous speeds because I forgot to do a proper timing before upgrading. I need to revert to the old version. However, I don't think I saw a factor four. It might have been a factor of two.

@jreadey
Copy link
Member

jreadey commented Mar 9, 2017

There will be some variability based on the latency between client and server. My testing was with a server running on the same LAN. Also, the test driver is different.

Did the NexPy GUI need a lot of mods to work with h5serv? I could set it up in my environment.

@rayosborn
Copy link
Author

I haven't made any changes to the NeXpy GUI yet. In the latest development version on my own clone of the nexusformat API, the nxremote branch has an added file, which subclasses the NXFile class for remote access. I was thinking of pushing this version to PyPI, since it is a test feature that only users with h5pyd would even be able to access. I'll let you know when I've done that.

@jreadey
Copy link
Member

jreadey commented Mar 9, 2017

If you push the branch to github, I can just grab from there.

How would I use it to list the contents of a Nexus file?

@rayosborn
Copy link
Author

The nxremote branch has been published on my Github. You can load a file by typing:

>>> a=nxloadremote(filepath, domain='exfac.org', server='some.server:5000')
>>> print(a.tree)

The file path is the path relative to the data directory. The module converts that to a domain name. The top domain is currently 'exfac.org' to match the test repository.

@jreadey
Copy link
Member

jreadey commented Dec 1, 2022

@rayosborn - some updates on this old issue...
By default h5pyd.File(filepath) will return all the meta data for the domain in the request response. H5pyd caches this, so any attribute read or link access won't need to talk to the server. There's a limit on the number of objects fetched on the server of 500. This is so the GET request doesn't take an inordinate amount of time for domains with lots of attributes and/or links.

To compare the performance not using the prefetch, you can use: h5pyd.File(filepath, use_cache=False). This will return just information on the root group.

@rayosborn
Copy link
Author

Thanks, @jreadey. I can't look into this for a couple of weeks, but I plan to soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants