-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO Abstraction #68
IO Abstraction #68
Conversation
A Protocol approach to IO abstraction will provide more flexibility for this application over the Abstract Base Class (ABC) approach I previously mapped out. Some of the IO abstractions will only support a subset of the operations (e.g. a UCI abstraction will not include read and write method for timeseries). The protocol approach allows up to more easily implement these types of IO abstractions while still defining what the interface typing should be.
Transferring over the IOAbstraction Branch from LimnoTech Repositiory
Having the main method keep track of the various IO objects seemed arduous. This allows up to keep any additional IO related implementation details (e.g. in memory caching of timeseries) away from the main model code.
Fix no return on IOManger.read_uci Fixes to type hints Remove memory manage from get_flows because this is now handled by IOManger Rename IOManger read and write methods to match protocols Added *Args **Kwargs to IOMangerMethods
Previous implementation of timeseries caching passed reference to cached timeseries. This meant that modifications to the referenced timeseries (such as applying an mfactor) would also modify the cached timeseries. Resolved by adding pd.DataFrame.copy method.
Rename IO protocol implementation with 'ts' instead of 'timeseries' to be consistent with protocol names.
Replaces 'store' reference in majority of model modules.
My previous solution for removing the 'store' (direct IO) reference from SNOW and ATEMP was to add special case code to read_ts that allowed for these timeseries to be read from IO. However, my understanding is in HSPF these blocks are hard coded and not modifiable by the users. Rather than make this a complex IO special case, I just took the same approach and added hard coded versions in utilities. Should handling with these via IO be required, I recommend handing them via the read_uci method, as my original read_ts approach was incorrect.
Need to remove IO elements from RQUAL module.
@PaulDudaRESPEC @aufdenkampe @tredder75 With this pull request, I believe the IO abstracted version of the model is ready to merge in. As mentioned in the PR, all IO is now routed through the IOManager instance. To create a IOManager instance, it needs to be initialized with one or more class instance(s) for a data source implement various IO Protocols. In it's simplest conception though, you can just create a single HDF5 class instance and pass that to the IOManager.
This is very similar to how executing the model currently works, but still slightly different with a few extra up front steps. This means we will also need to update the example notebooks to reflect this new approach. With the workshop in January quickly approaching, let me know your thoughts on pulling this in soon versus waiting until afterwards. I'm leaning towards pulling this is sooner so that we just teach folks the new way IO works for the model but I can see a case for waiting as well. |
This was a development workable to allow me to execute the model while slowly abstracting pieces of the IO. Now that it has fully been abstracted this workaround should be removed to the merge back into the develop branch.
…uted time series instead of going back to h5 file
@ptomasula, thank you for developing this critical refactoring of the code! This work will unlock many additional opportunities to enhance performance and interoperability! Regarding your question in #68 (comment) on whether or not we want to merge before the January workshops for ERDC, I'm strongly leaning toward moving forward with the merge now. As @sjordan29 and I started updating the tutorial notebooks in early December (here: https://github.com/LimnoTech/HSPsquared/), I was already looking toward an immediate need to read the UCI and WDM files from our test directories yet write the output files to a temporary tutorials directory. I think this will help with that, or at least provide opportunities for us to demo the I/O approaches of our future in the workshop. If you say it works with Test10, then I would like to merge into develop today. |
@ptomasula and @aufdenkampe , I began reviewing these enhancements today -- so far they all look great! It appears that I need to start using python 3.8 (instead of 3.7) however to use Protocol from typing -- I'll make that upgrade and continue testing.... |
@PaulDudaRESPEC, if you have Conda working, you can create a separate Python 3.8 Conda environment for testing, while still keeping your 3.7 environment. Try using my installation instructions and the environment_dev.yml environment file. |
This all looks great, nice work. Note issue #72 was discovered while testing this enhancement, but it clearly is not a problem created by these code changes. |
Related Issue
#59
Problem
As alluded to in issue #59, the model code is tightly coupled with HDF5. Specifically, IO operations occur directly within the model code and are handled through the use of a Pandas HDFStore instance. This tight coupling decreases code maintainability and interoperability (i.e. support for different data formats) because any changes to IO behavior involves altering numerous lines of code in many different modules.
Solution
This pull request implements IO abstraction through the use of a IOManager class. This pull request also establishes a set of IO Protocols to define the interface between a class implementing IO for a specific data format and the rest of the application. Additionally this pull request refactors the model code to route all IO operation through an IOManager class instance. This increases code maintainable because nearly all changes to IO behavior can now be addressed through modifications to the class implementing the various IO protocols and/or the IOManager class. Additionally the use of Protocols provide for increased interoperability by defining the expected interface between a data source and the application. Additional classes can now be developed that implement these protocols to provide support for data sources beyond HDF5. Lastly, this pull request includes a simple form of memory caching for timeseries by storing previously loaded timeseries inside of a dictionary in the IOManager class.
Testing
Automated testing was conducted using test10 and the RegressTest class. The HTML report was compared with that of one generated running test10 on the develop branch to ensure the results matched.
Next Steps
This pull request implements IO Abstraction and implementation of IO Protocols for HDF5 through the HDF5 Class. Further work is need to support more data formats through the development of additional classes implementing the IO Protocols. Additionally, a simply memory caching approach was implemented, which stores all previously loaded or written timeseries in a dictionary. A subsequent request for the same timeseries will first try to return a copy from the in memory dictionary before loading from the data source. This approach works well for smaller and short timespan models, but may grow to have a large memory footprint for the execution of larger models spending longer timespans. Further refinement to memory caching could be accomplished through some type of inventory management approach such as First-if First-out (FIFO) and the use of an OrderedDict. I purpose the development teams discuss the most appropriate inventory management system for cached timeseries.