-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of HDF5 read/write #36
Comments
In Bob's June 12 "HSP2 Status" email, he writes in response to PaulD's issues with viewing results tables in HDFView:
|
From Bob's July 13 "HSP2 Status" email:
|
@rheaphy, thanks for all your deep sleuthing and hard work to figure out these performance issues. I just noticed that HDF5 1.10.7 was released on Sep. 15, boasting additional performance improvements and full backward compatibility back to v1.10.3. See: The last release of H5py (v2.10) was Sep. 6, 2019, but it looks like they're about to release v3.0 any day now based on h5py/h5py#1673. |
h5py 3.0 was just released, and it has a number of very nice performance features. See https://docs.h5py.org/en/latest/whatsnew/3.0.html. Performance improvements include:
So HDF5 1.10.7 could get us back closer to v1.8 performance, but v1.12 may have other advantages that make it worthwhile.
NOTE: The new h5py 3.0 might require some recoding, given this feature change:
|
@rheaphy, I've recently learned that there are now several newer high performance data storage formats that have equal to better performance than HDF5 and are much better suited to cloud applications. See HDF in the Cloud: challenges and solutions for scientific data. The most established of these is Parquet, which beats HDF5 in most metrics in this blog: The Best Format to Save Pandas Data. Parquet is integrated nicely with Pandas. However, I don't think it handles multi-dimensional data very well, although I'm not sure we strictly need that. It's even possible that breaking up the input/output into multiple parquet files might have an advantage, including for storage on GitHub. The Pangeo geoscience big data initiative, has moved toward converting netCDF files to Zarr format. See Pangeo's Data in the Cloud page. Pangeo's Xarray library is designed as a multi-dimensional equivalent to Pandas, and seamlessly reads/writes netCDF & Zarr formats, in addition to working directly with pulling data from NOAA THREDDS data servers, such as those used for distributing climate data and National Water Model outputs. Last, we are starting to explore the use of Dask to parallelize our data engine systems. Dask is another core library of Pangeo, and it works well with Numba. I don't think this is necessarily the next step for HSP2, but I think it is where we might want to head and it's worth mentioning sooner than later so that we can start moving in the right direction. |
This uses anaconda's default conda channel, which has better quality control for compatibility. Also adds `conda` and `conda-build` to make it easier to update. Does not import `h5py`, which @rheaphy intends to use over `pytables`, as described in respec#36 (comment)
Merge Unit testing work from RESPEC
With following, we believe that we've addressed most of the problems described in this issue.
|
@rheaphy emailed with his 2020-05-24 "HSP2 status update":
These some of these recent commits are cca2b0c, d154e55, and e92c035.
The text was updated successfully, but these errors were encountered: