Replies: 2 comments 2 replies
-
I like the idea of a I think we should think about the XBRL extractor. While I don't have the context of the current setup or the possible downsides to this suggestion, I am inclined to go the opposite direction with the extractor. Could we bubble off the XBRL and dbf extractor and treat the archived sqlites as the "raw" datasources? The dbf & xbrl -> sqlite steps just feel pretty separate, seem somewhat stable and take a ton of time. What if we just re-ran those sqlite makers on a regular basis and make draft zenodo archives with them. |
Beta Was this translation helpful? Give feedback.
-
If the assumption holds that this process is largely static, we could consider splitting it off from the main ETL and have it as a standalone tool that can run the transformation/extraction independently and publish the results to well known location (e.g. zenodo+gcs), with the main To summarize, instead of loading raw archived dbf/xbrl files from zenodo and running two parts ETL one after another ( The benefits of the above scheme would be that we would run the relevant transformations only when applicable, i.e. I do think that separating |
Beta Was this translation helpful? Give feedback.
-
Motivation
As we continue to accumulate more and more lines of code, I think it's worth thinking deliberately about how we are organizing all of that code. Already, the code base can be pretty daunting to learn and extend, and has frustrating entanglements in certain places that lead to things like circular imports, and certain tools being less usable than they could be (
archiver
anddatastore
for example). I don't suggest that we refactor the entire code base all at once, but if we set some intentions and guidelines, we can slowly start to migrate our code base to a new structure while we work on small parts for other projects.PUDL code structure
Right now to understand a single table you have to look at code scattered throughout the repo in a way that's not immediately intuitive. Dagster has made following this lineage much easier, but it's still more difficult than it probably needs to be. There's also significant coupling between what could be generic "library code" and implementation logic for a specific dataset or table, with this code often living in the same modules, and in many cases being trapped inside a big asset function. I think this coupling often increases the number of circular imports we encounter, which are always a pain to work around. There are likely many patterns we could reach for, but here's one suggestion:
Proposed structure to untangle PUDL code
We should come up with a very clear boundary between what is generic tools/shared infrastructure/common operations, and the actual etl logic for different datasets. To do this, I propose that we split the repo into two separate sets of packages, one that is generic, and one that composes together that generic functionality into an actual ETL. In this organization, the following packages would be made completely generic (it might be best to rename some of these in this new structure):
Then, we could reorganize the ETL package to look something like the following:
This way everything pertaining to the ETL for a specific dataset will be captured in one place that should be easier to find and follow.
Other tools and repos
We should also think about the code we maintain outside of PUDL and how we want all of these to interact. It seems clear that there's desire to pull the
datastore
out of PUDL, as well as all the metadata to uncouple thearchiver
from PUDL, but broadly speaking how do we actually want these tools organized? Personally I've thought of merging thedatastore
and thearchiver
, so there's just one application responsible for all access to our archives.I could also see merging the XBRL extractor into PUDL, as it doesn't necessarily feel like a tool we will actually use outside of that context, and then we could more nicely integrate it with the rest of the dag.
Beta Was this translation helpful? Give feedback.
All reactions