Creating structure and guidelines around code organization #3136

zschira · 2023-12-08T16:03:20Z

zschira
Dec 8, 2023
Maintainer

Motivation

As we continue to accumulate more and more lines of code, I think it's worth thinking deliberately about how we are organizing all of that code. Already, the code base can be pretty daunting to learn and extend, and has frustrating entanglements in certain places that lead to things like circular imports, and certain tools being less usable than they could be (archiver and datastore for example). I don't suggest that we refactor the entire code base all at once, but if we set some intentions and guidelines, we can slowly start to migrate our code base to a new structure while we work on small parts for other projects.

PUDL code structure

Right now to understand a single table you have to look at code scattered throughout the repo in a way that's not immediately intuitive. Dagster has made following this lineage much easier, but it's still more difficult than it probably needs to be. There's also significant coupling between what could be generic "library code" and implementation logic for a specific dataset or table, with this code often living in the same modules, and in many cases being trapped inside a big asset function. I think this coupling often increases the number of circular imports we encounter, which are always a pain to work around. There are likely many patterns we could reach for, but here's one suggestion:

Proposed structure to untangle PUDL code

We should come up with a very clear boundary between what is generic tools/shared infrastructure/common operations, and the actual etl logic for different datasets. To do this, I propose that we split the repo into two separate sets of packages, one that is generic, and one that composes together that generic functionality into an actual ETL. In this organization, the following packages would be made completely generic (it might be best to rename some of these in this new structure):

src/pudl/extract/
src/pudl/transform/
src/pudl/analysis/
src/pudl/metadata/
...

Then, we could reorganize the ETL package to look something like the following:

src/pudl/etl/ferc1/{extract, transform, output}.py
src/pudl/etl/epacems/{extract, transform, output}.py
src/pudl/etl/eia/eia860/{extract, transform, output}.py
src/pudl/etl/output.py
...

This way everything pertaining to the ETL for a specific dataset will be captured in one place that should be easier to find and follow.

Other tools and repos

We should also think about the code we maintain outside of PUDL and how we want all of these to interact. It seems clear that there's desire to pull the datastore out of PUDL, as well as all the metadata to uncouple the archiver from PUDL, but broadly speaking how do we actually want these tools organized? Personally I've thought of merging the datastore and the archiver, so there's just one application responsible for all access to our archives.

I could also see merging the XBRL extractor into PUDL, as it doesn't necessarily feel like a tool we will actually use outside of that context, and then we could more nicely integrate it with the rest of the dag.

cmgosnell · 2023-12-12T15:58:20Z

cmgosnell
Dec 12, 2023
Collaborator

I like the idea of a src/pudl/etl/{dataset}/{extract, transform, output}.py rejigger. I'd be good to think about all of the shared stuff like what is currently in pudl/extract/excel. Would that all go in a higher level pudl/etl/extract? Also I don't know if I love the pudl/etl/eia/eia860 alongside the pudl/etl/{datasource}{form}. I know eia is special in that there are some shared form stuff but to me at least it would feel more consistent to do one or the other. I think i'm partial to your pudl/etl/{datasource}{form} suggestion as opposed to doing pudl/etl/{datasource}/{form} across the board bc most of the {datasource}/{form} would have only one form. But idk. I am just pro making it as consistent as possible.

I think we should think about the XBRL extractor. While I don't have the context of the current setup or the possible downsides to this suggestion, I am inclined to go the opposite direction with the extractor. Could we bubble off the XBRL and dbf extractor and treat the archived sqlites as the "raw" datasources? The dbf & xbrl -> sqlite steps just feel pretty separate, seem somewhat stable and take a ton of time. What if we just re-ran those sqlite makers on a regular basis and make draft zenodo archives with them.

2 replies

zaneselvans Dec 13, 2023
Maintainer

I've been mulling over the idea of going in the opposite direction with the XBRL & DBF to SQLite infrastructure as well. Like maybe it would be better if that code lived entirely in ferc-xbrl-extractor and ferc-dbf-extractor repositories / packages, and the ongoing testing of the XBRL and DBF to SQLite conversions were the responsibilities of those repositories rather than the main PUDL repository, since both the code and the input data change much less frequently than other things going on in PUDL, and the testing in CI is such a burden right now.

I'm hesitant to create a whole separate processing, publication, and archiving workflow for the SQLite DBs that they output though. Then those resources need to be managed separately, we need to ensure they are reproducible, make sure they keep working even as their underlying dependencies continue getting updated (e.g. Arelle), and anybody who wants to access those outputs needs to know where to find them independent of the PUDL data releases, etc. But if the processing of most of the FERC inputs only happened in the nightly builds, with just FERC 1 happening in PUDL CI, and all the others being tested in their own CI when something changed that would help speed things up and still make sure we know everything continues to function.

zaneselvans Dec 13, 2023
Maintainer

On the PUDL repo organization, I feel like I am much more frequently looking at several different ETL steps related to one data source, than a particular ETL step across several data sources, and so conceptually I think I would find it easier to have pudl/eia860/{extract, transform, output}.py with anything that's of general utility kept at a higher level like pudl/extract/excel.py But I don't feel strongly about this.

I feel like we have some top-level legacy organizational cruft that probably doesn't make sense any more.

pudl/workspace seems like it should be stripped for parts. setup.py is only there b/c of circular imports and the other two modules probably belong in the datastore package that would be split out.
The lines between pudl/{transform, analysis, output} have gotten blurry, and given the new data warehouse / DAG structure, I don't know that this organization makes any sense. I'm constantly finding myself confused about which area different kinds of assets are being generated now.
The pudl/metadata/classes.py module is kind of out of control. The metadata subpackage in general is a lot, and gosh it sure seems like we're doing things that other people must be doing elsewhere too.
I would like to look at consolidating all of our CLI tools under a single pudl command using Click's nested commands feature, and probably bringing in the devtools scripts as entrypoints too so that they can be at least nominally tested for imports continuing to work, help messages etc. We could have the CLI wrappers under a single pudl/cli subpackage that ether define the CLI and refer to the relevant modules elsewhere, or in the case of more stand-alone scripts, just contain the entire script. IIRC this could also be used to create a pudl subcommand that refers to functions from other packages too -- so if we had a coordinating xbrl2sqlite function / script in the ferc-xbrl-extractor package, it could be integrated into the pudl command through the CLI module.

rousik · 2023-12-13T21:19:22Z

rousik
Dec 13, 2023
Collaborator

ferc_to_sqlite is a strange beast. It does fairly straightforward low-level transformation of convoluted formats (dbf, xbrl) into ETL friendly formats (sqlite databases + metadata). There's assumption that this transformation is relatively fixed, implied by the fact that ferc_to_sqlite is run independetly from pudl_etl and in local environments, we assume that this can save some considerable amount of time/compute to not having to re-run this all the time. In CI/CD pipelines, we don't have any caching of this, so we're currently paying hefty fines for having to re-run this analysis every single time (40+ minutes for ci-integration step for the ferc_to_sqlite step only), with the main culprit being XBRL which seems to be transformation-unfriently format.

If the assumption holds that this process is largely static, we could consider splitting it off from the main ETL and have it as a standalone tool that can run the transformation/extraction independently and publish the results to well known location (e.g. zenodo+gcs), with the main pudl_etl process reading these transformed/sqlite-like outputs without being responsible for running this expensive transformation every single time.

To summarize, instead of loading raw archived dbf/xbrl files from zenodo and running two parts ETL one after another (ferc_to_sqlite followed by pudl_etl), we would have a process (ferc_to_sqlite) that can load raw data from zenodo and publish results in friendly formats back to zenodo, and then have pudl_etl load those friendly post-processed files from zenodo/gcs. ferc_to_sqlite would then be responsible for publishing data to zenodo and perhaps updating pudl_etl configuration to start reading from the newly published archives.

The benefits of the above scheme would be that we would run the relevant transformations only when applicable, i.e. ferc_to_sqlite would only run when there are new raw ferc files available, or when we make changes to the ferc_to_sqlite transformation logic. Once those files are published, we can kick off pudl_etl pointed at these new files to validate those changes. This could save a lot of compute that we currently spend on rerunning the same process over and over again and could clearly indicate when things change at particular points in the dependency graph. The downside is that this requires a bit more higher-level plumbing between the tools (I'm not sure what's the current state and how scalable this infrastructure for composing parts of the ecosystem are), and there's obviously some additional leg-work needed for maintaining data releases for yet another pipeline.

I do think that separating ferc_to_sqlite code out of the main pudl repo would be necessary/very useful for the above in order to be able to be able to tell when the transformation logic actually changes (otherwise, we would still have to run it on every change to pudl codebase)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Creating structure and guidelines around code organization #3136

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Catalyst Cooperative

Creating structure and guidelines around code organization #3136

zschira Dec 8, 2023 Maintainer

Motivation

PUDL code structure

Proposed structure to untangle PUDL code

Other tools and repos

Replies: 2 comments · 2 replies

cmgosnell Dec 12, 2023 Collaborator

zaneselvans Dec 13, 2023 Maintainer

zaneselvans Dec 13, 2023 Maintainer

rousik Dec 13, 2023 Collaborator

zschira
Dec 8, 2023
Maintainer

Replies: 2 comments 2 replies

cmgosnell
Dec 12, 2023
Collaborator

zaneselvans Dec 13, 2023
Maintainer

zaneselvans Dec 13, 2023
Maintainer

rousik
Dec 13, 2023
Collaborator