Abstract I/O & storage beyond HDF5 for flexibility, performance, & cloud #59

aufdenkampe · 2021-08-20T17:23:30Z

This high-level issue pulls together several past, current, and near-future efforts (and more granular issues).

The tight coupling of model input/output (I/O) with the Hierarchical Data Format v5 (HDF5) during the HSP2 runtime limits both performance (see #36) and also interoperability with other data storage formats such as the cloud-optimized Parquet and Zarr storage formats (see Pangeo's Data in the Cloud article) that are tightly coupled with high-performance data structures from foundations PyData libraries Pandas, Dask DataFrames, and Xarray.

Abstracting I/O using a class-based approach would also unlock capabilities for within-tilmestep coupling of HSP2 with other models. Specifically, HSP2 could provide upstream, time-varying boundary conditions for higher-resolutions models of reaches, reservoirs, and the groundwater-surface water interface.

Our overall plan was first outlined and discussed in LimnoTech#27 (Refactor I/O to rely on DataFrames & provide storage options). In brief, we would refactor to:

Run HSP2 by interacting with a dictionary of Pandas dataframes or Dask DataFrames in memory
- presently the model reads/writes to HDF5 during the model execution
Reading/writing to storage from the dictionary of dataframes is done with a separate set of functions.

cc: @PaulDudaRESPEC, @ptomasula

aufdenkampe · 2021-09-03T19:27:12Z

This is a great demo of performance profiling and optimization approaches, including I/O.
https://anaconda.org/TomAugspurger/pandas-performance/notebook

Here's a great blog post by the same author that discusses benchmarking in more detail: https://tomaugspurger.github.io/maintaing-performance.html

…uted time series instead of going back to h5 file

aufdenkampe · 2021-12-21T13:29:58Z

On Nov. 18, @PaulDudaRESPEC committed 0ed2302, which replaced multiple reads of data from storage with a single read into memory and subsequent data access to those in-memory objects. He shared this comment via email:

I just implemented a very significant performance improvement. Looking back at get_flows, I realized the old design was going back to the h5 file to read timeseries computed by upstream operations – I hadn’t really focused on that before – but of course very inefficient to do that reading from the file. Instead of doing that, I’ve saved those timeseries in-memory for later use. I’ve checked it into the repo if you’d like to take a closer look.

One example project used to run in 3.5 minutes, now runs in less than 1.5 minutes!

aufdenkampe · 2022-01-14T17:12:22Z

The foundation of this work was completed and tested with:

IO Abstraction #68

So we'll close this issue.

We'll expand on the I/O Abstraction capabilities, including implementing additional storage formats, with new issues.

aufdenkampe mentioned this issue Sep 3, 2021

Make a new release? PyTables/PyTables#826

Closed

ptomasula mentioned this issue Dec 20, 2021

IO Abstraction #68

Merged

aufdenkampe referenced this issue Dec 21, 2021

main.py -- performance improvement for get_flows, use catalog of comp…

0ed2302

…uted time series instead of going back to h5 file

aufdenkampe added this to the v0.10.0: Improved Performance, Abstracted I/O, and WQ Docs milestone Dec 22, 2021

aufdenkampe mentioned this issue Jan 10, 2022

Merge setup_pip in develop after testing #73

Merged

aufdenkampe closed this as completed Jan 14, 2022

aufdenkampe mentioned this issue Jan 14, 2022

Improve performance of HDF5 read/write #36

Closed

aufdenkampe mentioned this issue Sep 2, 2022

hdf5 file sizes are enormous #94

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstract I/O & storage beyond HDF5 for flexibility, performance, & cloud #59

Abstract I/O & storage beyond HDF5 for flexibility, performance, & cloud #59

aufdenkampe commented Aug 20, 2021 •

edited

Loading

aufdenkampe commented Sep 3, 2021 •

edited

Loading

aufdenkampe commented Dec 21, 2021

aufdenkampe commented Jan 14, 2022

Abstract I/O & storage beyond HDF5 for flexibility, performance, & cloud #59

Abstract I/O & storage beyond HDF5 for flexibility, performance, & cloud #59

Comments

aufdenkampe commented Aug 20, 2021 • edited Loading

aufdenkampe commented Sep 3, 2021 • edited Loading

aufdenkampe commented Dec 21, 2021

aufdenkampe commented Jan 14, 2022

aufdenkampe commented Aug 20, 2021 •

edited

Loading

aufdenkampe commented Sep 3, 2021 •

edited

Loading