IO Abstraction #68

ptomasula · 2021-12-20T20:20:23Z

Related Issue

Problem

As alluded to in issue #59, the model code is tightly coupled with HDF5. Specifically, IO operations occur directly within the model code and are handled through the use of a Pandas HDFStore instance. This tight coupling decreases code maintainability and interoperability (i.e. support for different data formats) because any changes to IO behavior involves altering numerous lines of code in many different modules.

Solution

This pull request implements IO abstraction through the use of a IOManager class. This pull request also establishes a set of IO Protocols to define the interface between a class implementing IO for a specific data format and the rest of the application. Additionally this pull request refactors the model code to route all IO operation through an IOManager class instance. This increases code maintainable because nearly all changes to IO behavior can now be addressed through modifications to the class implementing the various IO protocols and/or the IOManager class. Additionally the use of Protocols provide for increased interoperability by defining the expected interface between a data source and the application. Additional classes can now be developed that implement these protocols to provide support for data sources beyond HDF5. Lastly, this pull request includes a simple form of memory caching for timeseries by storing previously loaded timeseries inside of a dictionary in the IOManager class.

Testing

Automated testing was conducted using test10 and the RegressTest class. The HTML report was compared with that of one generated running test10 on the develop branch to ensure the results matched.

Next Steps

This pull request implements IO Abstraction and implementation of IO Protocols for HDF5 through the HDF5 Class. Further work is need to support more data formats through the development of additional classes implementing the IO Protocols. Additionally, a simply memory caching approach was implemented, which stores all previously loaded or written timeseries in a dictionary. A subsequent request for the same timeseries will first try to return a copy from the in memory dictionary before loading from the data source. This approach works well for smaller and short timespan models, but may grow to have a large memory footprint for the execution of larger models spending longer timespans. Further refinement to memory caching could be accomplished through some type of inventory management approach such as First-if First-out (FIFO) and the use of an OrderedDict. I purpose the development teams discuss the most appropriate inventory management system for cached timeseries.

A Protocol approach to IO abstraction will provide more flexibility for this application over the Abstract Base Class (ABC) approach I previously mapped out. Some of the IO abstractions will only support a subset of the operations (e.g. a UCI abstraction will not include read and write method for timeseries). The protocol approach allows up to more easily implement these types of IO abstractions while still defining what the interface typing should be.

Transferring over the IOAbstraction Branch from LimnoTech Repositiory

Having the main method keep track of the various IO objects seemed arduous. This allows up to keep any additional IO related implementation details (e.g. in memory caching of timeseries) away from the main model code.

Fix no return on IOManger.read_uci Fixes to type hints Remove memory manage from get_flows because this is now handled by IOManger Rename IOManger read and write methods to match protocols Added *Args **Kwargs to IOMangerMethods

Previous implementation of timeseries caching passed reference to cached timeseries. This meant that modifications to the referenced timeseries (such as applying an mfactor) would also modify the cached timeseries. Resolved by adding pd.DataFrame.copy method.

Rename IO protocol implementation with 'ts' instead of 'timeseries' to be consistent with protocol names.

Replaces 'store' reference in majority of model modules.

My previous solution for removing the 'store' (direct IO) reference from SNOW and ATEMP was to add special case code to read_ts that allowed for these timeseries to be read from IO. However, my understanding is in HSPF these blocks are hard coded and not modifiable by the users. Rather than make this a complex IO special case, I just took the same approach and added hard coded versions in utilities. Should handling with these via IO be required, I recommend handing them via the read_uci method, as my original read_ts approach was incorrect.

Implement a class to collect UCI parameters

Need to remove IO elements from RQUAL module.

ptomasula · 2021-12-20T20:38:05Z

@PaulDudaRESPEC @aufdenkampe @tredder75

With this pull request, I believe the IO abstracted version of the model is ready to merge in. As mentioned in the PR, all IO is now routed through the IOManager instance. To create a IOManager instance, it needs to be initialized with one or more class instance(s) for a data source implement various IO Protocols. In it's simplest conception though, you can just create a single HDF5 class instance and pass that to the IOManager.

  from HSP2IO.hdf import HDF5
  from HSP2IO.io import IOManager
  hdf_path = 'some_string_pathway_to_your_hdf5_file'
  hdf5_instance = HDF5(hdf_path)
  io_manager = IOManager(hdf5_instance)
  main(io_manager)

This is very similar to how executing the model currently works, but still slightly different with a few extra up front steps. This means we will also need to update the example notebooks to reflect this new approach. With the workshop in January quickly approaching, let me know your thoughts on pulling this in soon versus waiting until afterwards. I'm leaning towards pulling this is sooner so that we just teach folks the new way IO works for the model but I can see a case for waiting as well.

This was a development workable to allow me to execute the model while slowly abstracting pieces of the IO. Now that it has fully been abstracted this workaround should be removed to the merge back into the develop branch.

…uted time series instead of going back to h5 file

aufdenkampe · 2021-12-21T14:16:20Z

@ptomasula, thank you for developing this critical refactoring of the code! This work will unlock many additional opportunities to enhance performance and interoperability!

Regarding your question in #68 (comment) on whether or not we want to merge before the January workshops for ERDC, I'm strongly leaning toward moving forward with the merge now.

As @sjordan29 and I started updating the tutorial notebooks in early December (here: https://github.com/LimnoTech/HSPsquared/), I was already looking toward an immediate need to read the UCI and WDM files from our test directories yet write the output files to a temporary tutorials directory. I think this will help with that, or at least provide opportunities for us to demo the I/O approaches of our future in the workshop.

If you say it works with Test10, then I would like to merge into develop today.

PaulDudaRESPEC · 2021-12-23T21:33:04Z

@ptomasula and @aufdenkampe , I began reviewing these enhancements today -- so far they all look great! It appears that I need to start using python 3.8 (instead of 3.7) however to use Protocol from typing -- I'll make that upgrade and continue testing....

aufdenkampe · 2021-12-23T21:50:33Z

@PaulDudaRESPEC, if you have Conda working, you can create a separate Python 3.8 Conda environment for testing, while still keeping your 3.7 environment.

Try using my installation instructions and the environment_dev.yml environment file.

PaulDudaRESPEC · 2022-01-10T17:55:37Z

This all looks great, nice work. Note issue #72 was discovered while testing this enhancement, but it clearly is not a problem created by these code changes.

ptomasula added 19 commits October 21, 2021 09:46

Draft IOBase Class

2178989

Merge pull request #66 from LimnoTech/IO_Abstraction

4e16fb5

Transferring over the IOAbstraction Branch from LimnoTech Repositiory

First cut IO abstraction

33881da

Rename protocols to SupportsXXX

f2987ac

Implement IO wrapper class

9528021

Having the main method keep track of the various IO objects seemed arduous. This allows up to keep any additional IO related implementation details (e.g. in memory caching of timeseries) away from the main model code.

Fix minor bugs

bb9d71d

Fix no return on IOManger.read_uci Fixes to type hints Remove memory manage from get_flows because this is now handled by IOManger Rename IOManger read and write methods to match protocols Added *Args **Kwargs to IOMangerMethods

Bug Fix timeseries caching

1743766

Previous implementation of timeseries caching passed reference to cached timeseries. This meant that modifications to the referenced timeseries (such as applying an mfactor) would also modify the cached timeseries. Resolved by adding pd.DataFrame.copy method.

Move MF and AF manipulation within Try Except block

f71f291

Replace protocols with 'ts'

091bfaf

Rename IO protocol implementation with 'ts' instead of 'timeseries' to be consistent with protocol names.

First cut at module level IO abstraction

6bf978b

Replaces 'store' reference in majority of model modules.

UCI class

31903b3

Implement a class to collect UCI parameters

Read FTABLES as part of UCI protocol

5feb80d

IO MONTHDATA from UCI protocol

d7ff358

Need to remove IO elements from RQUAL module.

Add IO protocols for logging

3ec1832

Remove unused package references

156a442

Clean-up docstring for IOManager

8b212e7

Fix broken __del__ method for IOManager

43b98d3

ptomasula added the enhancement label Dec 20, 2021

ptomasula requested review from aufdenkampe, PaulDudaRESPEC and tredder75 December 20, 2021 20:20

Remove development work around from main

a4c6706

This was a development workable to allow me to execute the model while slowly abstracting pieces of the IO. Now that it has fully been abstracted this workaround should be removed to the merge back into the develop branch.

aufdenkampe referenced this pull request Dec 21, 2021

main.py -- performance improvement for get_flows, use catalog of comp…

0ed2302

…uted time series instead of going back to h5 file

aufdenkampe added this to the v0.10.0: WQ w/ Improved Performance, Abstracted I/O, Updated Docs milestone Dec 22, 2021

PaulDudaRESPEC merged commit c41439f into develop Jan 10, 2022

This was referenced Jan 10, 2022

Merge setup_pip in develop after testing #73

Merged

Abstract I/O & storage beyond HDF5 for flexibility, performance, & cloud #59

Closed

Improve performance of HDF5 read/write #36

Closed

PaulDudaRESPEC mentioned this pull request Feb 18, 2022

Error running intro script #87

Closed

aufdenkampe mentioned this pull request Dec 13, 2024

I/O Abstraction #182

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO Abstraction #68

IO Abstraction #68

ptomasula commented Dec 20, 2021

ptomasula commented Dec 20, 2021

aufdenkampe commented Dec 21, 2021 •

edited

Loading

PaulDudaRESPEC commented Dec 23, 2021

aufdenkampe commented Dec 23, 2021

PaulDudaRESPEC commented Jan 10, 2022

IO Abstraction #68

IO Abstraction #68

Conversation

ptomasula commented Dec 20, 2021

Related Issue

Problem

Solution

Testing

Next Steps

ptomasula commented Dec 20, 2021

aufdenkampe commented Dec 21, 2021 • edited Loading

PaulDudaRESPEC commented Dec 23, 2021

aufdenkampe commented Dec 23, 2021

PaulDudaRESPEC commented Jan 10, 2022

aufdenkampe commented Dec 21, 2021 •

edited

Loading