-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hdf5 file sizes are enormous #94
Comments
@rburghol , I concur that the h5 file size deserves more attention. At this stage of development I've personally focused on making sure the numbers produced by HSP2 are equivalent to those produced by HSPF (within certain thresholds, etc.), and to that end for my test simulations I've written all possible output timeseries to h5 (using saveall) for HSP2 and all possible output timeseries to hbn for HSPF. When doing that kind of test, the file sizes from HSP2 and HSPF are within the same order of magnitude. FWIW, I also note that importing the UCI by itself, with no timeseries data, results in a relatively large h5 file. All that's to say, focusing on the h5 file size issue would be a worthy pursuit. |
Thanks for the follow up @PaulDudaRESPEC -- we are taking a deep dive into this as we learn the system and there appears to be some promising avenues, I will update here if we have anything of note. I concur with your assessment about just importing the UCI, the majority of size expansion happens there in my tests as well. One thing we are going to look into is whether hdf5 has options to enable or disable indexing when creating new datasets. In my experience with other databases, sometimes having numerous indexes can multiply table storage greatly, and perhaps there are defaults that do this in the case of hsp2 storage. Or maybe it's something else :) |
@rburghol & @PaulDudaRESPEC, the enormous size of HDF5 files are due to the fact that @rheaphy turned off compression back in early 2020, in large part because the HDFView desktop software can't view compressed data values but also because in early 2020 there were a number of real issues with HDF5 v1.10.x and with our PyTables library compatibility with newer HDF5 versions. For details on that history, see: The good news is that since January 2022 we now have solutions to all those issues.
Unfortunately, our pause in funding means that we never had an opportunity to fully implement the HDF5 improvements that became available in January 2022. If you are interested in exploring the benefits of the newer PyTables, I encourage you to install from our development environment, |
You should be able to leverage compression with the code in its current state. You just need to set the |
As an interesting aside, HDF5 1.12.2 released in April 2022 added substantial improvements to parallel compression performance and memory usage. I'm hoping we can leverage that one way or another. |
Thanks to all this is a really great bit of info/progress. I will keep posted as I do testing on the file size/compression issue. |
In addition to the compression issues discussed on this thread, we have now implemented an enhancement to control the output time step -- using the BINARY-INFO table to specify aggregation of the output time series to daily, monthly, or annual. The first cut of this enhancement is available in the develop branch, here: |
@rburghol, Do we need to revisit this issue? Did you have any success with compression? The new conda Also, the |
Hey @aufdenkampe I am sorry to say that I have made zero progress on this. I'll try to get around to this, but if anybody else wants to take a stab at it I would be delighted. Currently I'm just suffering along with aggressive file maintenance. |
Testing for system space requirements when running a roughly 300-1,000 reach simulation, ported from an older version of HSPF. When using
hsp2 import
all uci files and supporting wdm data are imported into an h5 database. Before even running the model, the resulting h5 file is nearly 200x larger than the component files. See Test 1-4 below for file details and commands.This is not insurmountable, but at the time I think the only way this simulation would be doable on our current system is to be very aggressive with disk management, and essentially deleting the h5 file after running, post-processing, and extracting output data of interest into text files. Interested in your thoughts on how one might optimize this (if there is a way) @aufdenkampe
Questions:
test10.h5
from respec repo yields 27M from 200k of source files. See Test 1 below for details.Test 1: Importing test10.uci with hdf5 version 1.10.4
Test 2: Importing UCI from Chesapeake Bay Model 5.3.2 single RCHRES only (no PERLND). Download files to test from: https://github.com/HARPgroup/HSPsquared/tree/master/tests/test_cbp_river
Test 3: Importing UCI from Chesapeake Bay Model 5.3.2 single PERLND only (no RCHRES). Download files to test from: https://github.com/HARPgroup/HSPsquared/tree/master/tests/test_cbp_land
Test 4: Importing test10.uci with hdf5 version 1.13.1 - same file size as with version 1.10.4.
The text was updated successfully, but these errors were encountered: