Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO Abstraction #68

Merged
merged 20 commits into from
Jan 10, 2022
Merged

IO Abstraction #68

merged 20 commits into from
Jan 10, 2022

Conversation

ptomasula
Copy link
Collaborator

Related Issue

#59

Problem

As alluded to in issue #59, the model code is tightly coupled with HDF5. Specifically, IO operations occur directly within the model code and are handled through the use of a Pandas HDFStore instance. This tight coupling decreases code maintainability and interoperability (i.e. support for different data formats) because any changes to IO behavior involves altering numerous lines of code in many different modules.

Solution

This pull request implements IO abstraction through the use of a IOManager class. This pull request also establishes a set of IO Protocols to define the interface between a class implementing IO for a specific data format and the rest of the application. Additionally this pull request refactors the model code to route all IO operation through an IOManager class instance. This increases code maintainable because nearly all changes to IO behavior can now be addressed through modifications to the class implementing the various IO protocols and/or the IOManager class. Additionally the use of Protocols provide for increased interoperability by defining the expected interface between a data source and the application. Additional classes can now be developed that implement these protocols to provide support for data sources beyond HDF5. Lastly, this pull request includes a simple form of memory caching for timeseries by storing previously loaded timeseries inside of a dictionary in the IOManager class.

Testing

Automated testing was conducted using test10 and the RegressTest class. The HTML report was compared with that of one generated running test10 on the develop branch to ensure the results matched.

Next Steps

This pull request implements IO Abstraction and implementation of IO Protocols for HDF5 through the HDF5 Class. Further work is need to support more data formats through the development of additional classes implementing the IO Protocols. Additionally, a simply memory caching approach was implemented, which stores all previously loaded or written timeseries in a dictionary. A subsequent request for the same timeseries will first try to return a copy from the in memory dictionary before loading from the data source. This approach works well for smaller and short timespan models, but may grow to have a large memory footprint for the execution of larger models spending longer timespans. Further refinement to memory caching could be accomplished through some type of inventory management approach such as First-if First-out (FIFO) and the use of an OrderedDict. I purpose the development teams discuss the most appropriate inventory management system for cached timeseries.

A Protocol approach to IO abstraction will provide more flexibility for this application over the Abstract Base Class (ABC) approach I previously mapped out. Some of the IO abstractions will only support a subset of the operations (e.g. a UCI abstraction will not include read and write method for timeseries). The protocol approach allows up to more easily implement these types of IO abstractions while still defining what the interface typing should be.
Transferring over the IOAbstraction Branch from LimnoTech Repositiory
Having the main method keep track of the various IO objects seemed arduous. This allows up to keep any additional IO related implementation details (e.g. in memory caching of timeseries) away from the main model code.
Fix no return on IOManger.read_uci
Fixes to type hints
Remove memory manage from get_flows because this is now handled by IOManger
Rename IOManger read and write methods to match protocols
Added *Args **Kwargs to IOMangerMethods
Previous implementation of timeseries caching passed reference to cached timeseries. This meant that modifications to the referenced timeseries (such as applying an mfactor) would also modify the cached timeseries. Resolved by adding pd.DataFrame.copy method.
Rename IO protocol implementation with 'ts' instead of 'timeseries' to be consistent with protocol names.
Replaces 'store' reference in majority of model modules.
My previous solution for removing the 'store' (direct IO) reference from SNOW and ATEMP was to add special case code to read_ts that allowed for these timeseries to be read from IO. However, my understanding is in HSPF these blocks are hard coded and not modifiable by the users. Rather than make this a complex IO special case, I just took the same approach and added hard coded versions in utilities. Should handling with these via IO be required, I recommend handing them via the read_uci method, as my original read_ts approach was incorrect.
Implement a class to collect UCI parameters
Need to remove IO elements from RQUAL module.
@ptomasula
Copy link
Collaborator Author

@PaulDudaRESPEC @aufdenkampe @tredder75

With this pull request, I believe the IO abstracted version of the model is ready to merge in. As mentioned in the PR, all IO is now routed through the IOManager instance. To create a IOManager instance, it needs to be initialized with one or more class instance(s) for a data source implement various IO Protocols. In it's simplest conception though, you can just create a single HDF5 class instance and pass that to the IOManager.

  from HSP2IO.hdf import HDF5
  from HSP2IO.io import IOManager
  hdf_path = 'some_string_pathway_to_your_hdf5_file'
  hdf5_instance = HDF5(hdf_path)
  io_manager = IOManager(hdf5_instance)
  main(io_manager)

This is very similar to how executing the model currently works, but still slightly different with a few extra up front steps. This means we will also need to update the example notebooks to reflect this new approach. With the workshop in January quickly approaching, let me know your thoughts on pulling this in soon versus waiting until afterwards. I'm leaning towards pulling this is sooner so that we just teach folks the new way IO works for the model but I can see a case for waiting as well.

This was a development workable to allow me to execute the model while slowly abstracting pieces of the IO. Now that it has fully been abstracted this workaround should be removed to the merge back into the develop branch.
aufdenkampe referenced this pull request Dec 21, 2021
…uted time series instead of going back to h5 file
@aufdenkampe
Copy link
Collaborator

aufdenkampe commented Dec 21, 2021

@ptomasula, thank you for developing this critical refactoring of the code! This work will unlock many additional opportunities to enhance performance and interoperability!

Regarding your question in #68 (comment) on whether or not we want to merge before the January workshops for ERDC, I'm strongly leaning toward moving forward with the merge now.

As @sjordan29 and I started updating the tutorial notebooks in early December (here: https://github.com/LimnoTech/HSPsquared/), I was already looking toward an immediate need to read the UCI and WDM files from our test directories yet write the output files to a temporary tutorials directory. I think this will help with that, or at least provide opportunities for us to demo the I/O approaches of our future in the workshop.

If you say it works with Test10, then I would like to merge into develop today.

@PaulDudaRESPEC
Copy link
Member

@ptomasula and @aufdenkampe , I began reviewing these enhancements today -- so far they all look great! It appears that I need to start using python 3.8 (instead of 3.7) however to use Protocol from typing -- I'll make that upgrade and continue testing....

@aufdenkampe
Copy link
Collaborator

@PaulDudaRESPEC, if you have Conda working, you can create a separate Python 3.8 Conda environment for testing, while still keeping your 3.7 environment.

Try using my installation instructions and the environment_dev.yml environment file.

@PaulDudaRESPEC PaulDudaRESPEC merged commit c41439f into develop Jan 10, 2022
@PaulDudaRESPEC
Copy link
Member

This all looks great, nice work. Note issue #72 was discovered while testing this enhancement, but it clearly is not a problem created by these code changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants