Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract I/O & storage beyond HDF5 for flexibility, performance, & cloud #59

Closed
aufdenkampe opened this issue Aug 20, 2021 · 3 comments
Closed

Comments

@aufdenkampe
Copy link
Collaborator

aufdenkampe commented Aug 20, 2021

This high-level issue pulls together several past, current, and near-future efforts (and more granular issues).

The tight coupling of model input/output (I/O) with the Hierarchical Data Format v5 (HDF5) during the HSP2 runtime limits both performance (see #36) and also interoperability with other data storage formats such as the cloud-optimized Parquet and Zarr storage formats (see Pangeo's Data in the Cloud article) that are tightly coupled with high-performance data structures from foundations PyData libraries Pandas, Dask DataFrames, and Xarray.

Abstracting I/O using a class-based approach would also unlock capabilities for within-tilmestep coupling of HSP2 with other models. Specifically, HSP2 could provide upstream, time-varying boundary conditions for higher-resolutions models of reaches, reservoirs, and the groundwater-surface water interface.

Our overall plan was first outlined and discussed in LimnoTech#27 (Refactor I/O to rely on DataFrames & provide storage options). In brief, we would refactor to:

  • Run HSP2 by interacting with a dictionary of Pandas dataframes or Dask DataFrames in memory
    • presently the model reads/writes to HDF5 during the model execution
  • Reading/writing to storage from the dictionary of dataframes is done with a separate set of functions.

cc: @PaulDudaRESPEC, @ptomasula

@aufdenkampe
Copy link
Collaborator Author

aufdenkampe commented Sep 3, 2021

This is a great demo of performance profiling and optimization approaches, including I/O.
https://anaconda.org/TomAugspurger/pandas-performance/notebook

Here's a great blog post by the same author that discusses benchmarking in more detail: https://tomaugspurger.github.io/maintaing-performance.html

aufdenkampe referenced this issue Dec 21, 2021
…uted time series instead of going back to h5 file
@aufdenkampe
Copy link
Collaborator Author

On Nov. 18, @PaulDudaRESPEC committed 0ed2302, which replaced multiple reads of data from storage with a single read into memory and subsequent data access to those in-memory objects. He shared this comment via email:

I just implemented a very significant performance improvement. Looking back at get_flows, I realized the old design was going back to the h5 file to read timeseries computed by upstream operations – I hadn’t really focused on that before – but of course very inefficient to do that reading from the file. Instead of doing that, I’ve saved those timeseries in-memory for later use. I’ve checked it into the repo if you’d like to take a closer look.

One example project used to run in 3.5 minutes, now runs in less than 1.5 minutes!

@aufdenkampe
Copy link
Collaborator Author

The foundation of this work was completed and tested with:

So we'll close this issue.

We'll expand on the I/O Abstraction capabilities, including implementing additional storage formats, with new issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant