How to structure database, metadata, documentation to accommodate for new tables #2275

bendnorman · 2023-02-03T23:50:55Z

bendnorman
Feb 3, 2023
Maintainer

Now that we are using dagster we can persist interim, output, and analysis tables to a database so folks can access all of our data! This is great but poses some questions about how we want to structure our data products and documentation. Currently, our four types of data are distributed and documented in different ways:

Normalized tables: The clean normalized tables are distributed in pudl.sqlite and documented in our data dictionary.
The output and analysis tables: These tables are distributed via the PudlTabl class which requires you to installing the pudl package. Some table and column-level documentation exist in doc strings. Some column-level descriptions can also be referenced in the pudl data dictionary.
Raw and partially cleaned tables (non FERC): These tables aren't formally distributed. You can access them by installing pudl and piecing together parts of the ETL to get the data at the desired step. Some information about the structure of these tables exists in doc strings.
Raw FERC tables: These tables are distributed in the ferc1.sqlite and ferc*_xbrl.sqlite databases and are partially documented in ferc db data dictionary.

In theory, all of these tables can now be written to pudl.sqlite. This approach will likely overwhelm our users and clutter the database. The same goes for our data dictionaries.

How should we organize and distribute all of these tables?! Should we create separate databases for each raw data source and retain pudl.sqlite for our normalized and output tables? For example, we would have ferc1_dbf.sqlite, ferc1_xbrl.sqlite, eia860.sqlite, eia861.sqlite ... , pudl.sqlite.
What are our considerations here? Are we concerned about pudl.sqlite getting so big it becomes a hassle to distribute as a single file? What types of data does it make sense to group together if we need to split data up into multiple databases?
Do we want to save the raw extracted non FERC data to a database?
How should all of the relationships between these tables be documented?

Metadata

Currently, we only have resource metadata for our normalized tables. We'll need to create resource metadata for new tables we want to save to the database so constraints are checked and dtypes remain consistent as the tables move to and from the database.

How do we want to structure our resource metadata sub-package to accommodate all of these new tables? We could add the raw and partially cleaned tables' metadata to the datasource's module in the resource subpackage and categorize the table type using the etl_group key. We could create a new metadata.resources.outputs module to store all of the output table metadata.

bendnorman · 2023-02-04T00:25:35Z

bendnorman
Feb 4, 2023
Maintainer Author

Noramlized and output tables

I think it's important we separate raw and partially cleaned tables from our clean normalized and output tables. We don't want the clean stuff to get jumbled up with the dirty stuff. I think pudl.sqlite should continue to be our primary data product that contains these types of tables:

The clean normalized tables.
The output tables/views
The output tables that contain some python imputed columns
The analysis tables

Ideally, SQLite would support schemas and we could structure the database like this:

└── pudl.sqlite/
    ├── data_warehouse/
    │   ├── noramlized_table_1
    │   ├── noramlized_table_2
    │   ├── ...
    │   └── noramlzied_table_n
    ├── output_tables/
    │   ├── output_table_1
    │   ├── output_table_2
    │   ├── ...
    │   └── output_table_n
    └── analysis_tables/
        ├── analysis_table_1
        ├── analysis_table_2
        ├── ...
        └── analysis_table_n

SQLite doesn't support schemas so we'd have to establish this hierarchy with a naming convention or make it clear in the documentation how tables relate to one another.

Raw and interim tables

The organization of the raw and interim tables isn't as clear to me. We probably want to keep the raw extracted tables separate from the interim tables. This way if people just want access to the raw data they don't have to shift through partially cleaned tables. If we mix interim and raw tables together it might not be clear to users which tables came from the original dbf, xbrl, or excel files.

I think it makes sense to create a database for each data source so we have a standard distribution format for raw data. For example, we'll have ferc1_dbf.sqlite, ferc1_xbrl.sqlite, eia860.sqlite, ... eia861.sqlitedatabases. This would give users access to all years of a data source concatenated together so they don't have to merge excel and dbf files together. I could also see this confusing some users. For example, if someone is looking for clean eia860 data they should be interacting with the tables in pudl.sqlite, not eia860.sqlite.

As for interim tables... I'm not sure where these should live. Currently, the only interim tables we produce are the partially cleaned unharvested EIA tables. Where should these live?

0 replies

TrentonBush · 2023-02-04T20:43:19Z

TrentonBush
Feb 4, 2023

I think this question has different answers depending on whether we're talking about our tables as they exist today or an ideal, reorganized set of tables that may exist in the future. I'll try to only talk about the tables that exist today.

SQLite For The Forseeable Future

Maybe in the future we'll have to move away from local compute and storage, but it seems like sqlite is still up to the task. From one of Simon Willison's blog posts:

A lesser known feature of SQLite is that you can run queries, including joins, across tables from more than one database. The secret sauce is the ATTACH DATABASE command. Run the following SQL:

ATTACH 'other.db' AS other;

And now you can reference tables in that database as other.tablename. You can then join against them, combine them with UNION and generally treat them as if they were another table in your first connected database.

So maybe separate sqlite databases is fine? Certainly for distribution it seems fine.

Three Tiers of Output

In my view, there are three levels of table that are valuable to distribute to users: fully raw, warehouse (cleaned and reorganized), and analysis-ready (denormalized). Other interim tables may have internal value but don't warrant a commitment to documentation or stability like we expect for external facing tables.

Raw Tables

Raw data is always valuable for analysis because it has passed through fewer rounds of potentially biased or erroneous interpretation. Reading files and transforming them into a database is one of those potentially biased or erroneous actions, so in general I don't think making our own versions of "raw" data is useful or desired. But I think FERC and EIA data is an exception -- offering compiled raw tables is valuable to users because those sources distribute their data in very inconvenient ways: obsolete databases (FERC .dbf.), specialist file formats (FERC .xbrl), a bazillion disconnected files (EIA .xls). The act of reading and compiling them is a useful service unto itself. If these institutions distributed their raw data in easy to read CSVs (like EPA CEMS) or a database, I don't think we should bother providing compiled raw data.

I don't think it's worth committing to stability and documentation for something that has undergone so little custom work. So to save effort and avoid maintenance burden, I think we should refer people to the original documentation rather than duplicate it into our own version, and offer our internal metadata for technical users but not make any promises about it.

Warehouse Tables

We haven't yet reached a common understanding of what a data warehouse is, but to me it means cleaned, normalized, enriched data that has undergone either uncontroversial transformations or, if there are opinionated transformations (like imputation), there is raw-er data available as a fallback. To keep this discussion focused on existing tables, I would call pudl.sqlite and the CEMS .parquet files our data warehouse. I don't think we should make many changes here.

Analysis-ready Tables (Data Mart)

Because this output is supposed to be the most convenient, it needs documentation and stability. These are currently the pudl.output tables. It seems like we are about to make breaking changes to these and we should notify people ASAP.

Separate "Analysis Tables" Shouldn't Exist

We don't seem to have clarity about when things belong in output tables vs analysis tables vs data enrichment that belongs in the data warehouse. I think our current concept of the data warehouse needs refinement. We seem to conceive of it as only transformed external data, so a lot of our enrichment operations get pushed into output or analysis tables. To me, only two of those three things actually exist -- analysis tables existing separately to the data mart and data warehouse is indicative of a schema/organizational problem:

If it supplements information in the data warehouse (like imputation), it is "data enrichment" and belongs as a new column in the data warehouse.
If it combines data warehouse tables into a new form, it belongs in the data mart. [Update for clarity: these tables can include analysis beyond simple joins! I'm not familiar with all of our current "analysis tables" but I'd assume most can live here.]
If the table is too specialized or low-usage to justify membership in the data mart, then it doesn't belong in PUDL at all. Custom analyses should live in separate repos pinned to a PUDL release.

Much of that analysis work seems untested and unmaintained anyway. It's a dangling reputational risk that we can get rid of.

5 replies

bendnorman Feb 7, 2023
Maintainer Author

Raw and interim data

I agree that supporting/distributing interim tables might get complicated. I'm curious about what RMI folks @aesharpe and @cmgosnell think about documenting and distributing interim tables. Is the partially cleaned data valuable to RMI?

I also agree compiled raw version of the data is valuable and should be distributed. Maybe we should decide on a standard that defines when the raw data should be distributed.

Data Mart / Warehouse

On board with your assessment of the data mart and warehouse.

Just to clarify, we aren't planning to introduce many breaking changes to the PudlTabl object. All of the class's methods will return the same data but the data will be pulled from the database instead of being computed in python. This will make loading the output tables faster.

Analysis Tables

I like your categorization of where analysis tables should live. If an analysis function is imputing data or performing record linkage that enables normalization, it should live in the code that produces the data warehouse tables. Totally agree if it just combines data warehouse tables it should be an output table. If it is more of a one-off analysis maybe it should live in the pudl-examples repo? I'm not super familiar with all of the analysis tables though. @cmgosnell and @zaneselvans what do you think about @TrentonBush's analysis table categories?

cmgosnell Feb 7, 2023
Collaborator

Raw and interim data

on the interim tables, I can see the utility for interim tables like the pre-harvest EIA tables mostly as a debugging tool. I think only the very enthusiastic, more technical user/sometimes ourselves will want to play with those. So having some access would be nice but publishing them in a public facing way seems like it would be overkill and potentially confusing.

Analysis Tables

while i agree that more one-off explorations shouldn't be integrated and published, I've consistently seen the need for the results of things that I guess I would call analysis. I made a longer comment below, but the sometimes complex work of making our data more complete and more granular is often a baseline for folks to be able to actually use it "off the shelf" for their work. Derived values - even like derived ramp rates that you worked on a while back - could also often be super valuable. There feels like a big difference between the messy exploration of a problem and the possibly reproducible and defensible result.

zaneselvans Feb 8, 2023
Maintainer

I had no idea you could attach multiple SQLite DBs! That definitely seems like something we can take advantage of, effectively giving us a way to have schemas. At some point (not now) we might also look at using DuckDB instead, which does support schemas, and is maybe more appropriate to our use case. But they're still in a lot of flux.

Raw and interim data

I agree with the value of "raw" tables as you've outlined it Trenton. Letting users see the original data in a friendly container is very helpful, and will potentially allow users to compare data before vs. after our transformations without needing to download and navigate a bunch of spreadsheets or arcane formats. Unfortunately I think this is going to be something that applies to a lot of the data we're ingesting. The PHMSA data is spreadsheets. The EQR is a mess of nested zipped CSVs. All the FERC data is now a mix of DBF and XBRL.

It would be great if we could avoid needing to store detailed metadata for this large and ever changing collection of information, but I'm worried that we're going to get into data type issues if we're writing it into SQLite without a real schema, or if we write into SQLite to provide users access to the data in a raw form... but then we're actually using the pickled dataframes persisted by Dagster as the source for our downstream data processing.

There's not a uniform set of transformation steps that apply across all of our datasets, and the first point at which we have real column data types and metadata is different for different data sets. For EIA it's post-transform / pre-harvesting (at which point every column is one that will show in the DB somewhere, even if it's not in the dataframe it's in when it goes into harvesting). For FERC1 we have metadata associated with the XBRL-derived SQLite database, but kinda not really for the DBF-derived SQLite DB.

Data Warehouse / Mart

I'm not sure I agree or understand the particular distinction here. Our data is structured like it is mostly for historical reasons / prior technical limitations and design choices made by our former less-informed selves. I think some interesting dividing lines include:

Raw data, but in an easy to use container (as discussed above)
Cleaned data (conforms to real data types, invalid values have been fixed or nulled)
Restructured / normalized (e.g. post-harvesting EIA tables, has both data types and structural constraints / FKs)
Enriched (imputations to fill in missing values, currently this happens both within and post-ETL)
Denormalized (human readable denormalized tables that combine information from many places, currently output tables)
Derived Values / Analyses (novel calculations that do more than filling in missing values or combining tables, currently "analysis" tables)

Historically we tried to provide a single version of the data in the DB, and so couldn't address all of these different needs / use cases. We were hesitant to put imputations or analyses into the DB and shunted them into post-ETL methods, so that the one version of the data in the DB would be (more) directly comparable to the data as reported to the agencies.

I thought that Data Warehouses were usually denormalized tables with e.g. a star schema or something similar?

Analysis Tables

In the new context we can provide different versions of the data with different levels of processing applied to them, derived values of general interest should just be another table in the database -- that happens to depend on some Python function(s) and other tables in the database. The human-readable denormalized tables will also be of general interest and available in the database, the main difference in my mind here is that they can be constructed simply using SQL, just combining existing data, rather than really producing any novel content.

So I guess I don't really understand where the "Data Mart" lives. Is it just throw-away functions that calculate something using the data warehouse, and never get distributed to anyone? It seems like there should be a path for migrating analyses from that status into the main data warehouse as they mature if we or others find them of general interest. Accumulating "good enough" versions of these calculations in a public resource is a big part of our mission in my mind -- generator level heat rates, estimated fuel costs, capital costs, linkages between FERC/EPA/EIA plants/units, etc. We shouldn't be hoarding those things because they might be imperfect. We should be making them public so others can benefit from them and help us improve them over time.

Right now we have "data enrichment" happening in a lot of different places, which isn't great for maintainability, and since it's often happening post-ETL (in the outputs / analysis) it's not available to most users, which means lots of people have to do that kind of work over and over again in different ways, whereas if there were a decent version of the enriched data in the DB, they'd probably just use that and be able to get to doing interesting novel things much faster -- especially if we tag the imputed values and/or provide access to the unenriched data for reference (if they realllllly want to do it themselves in some other way).

TrentonBush Feb 8, 2023

some interesting dividing lines

I agree those all look like useful categories! I think the boundaries are inherently blurry, but the motivation is to support two kinds of workflows: 1) high quality building blocks for custom work (warehouse) and 2) convenient answers to common questions (mart). One is a collection of building blocks, the other contains assembled one-size-fits-most buildings. We can't anticipate every need, and different analyses may have conflicting requirements, so the data mart tables will never satisfy everyone. Heavy PUDL users should probably make their own data mart tailored to whatever their common questions are (which is exactly what RMI seems to have done).

TrentonBush Feb 9, 2023

there should be a path for migrating analyses ... into the main data warehouse

Agreed. I think that's mostly a governance/strategy challenge about deciding what is worth the effort of integrating and maintaining and where the scope of PUDL ends. Maybe we could devote X% of annual profits to incorporating new stuff?

I think the hard part about incorporating models is that they all have windows of validity due to their underlying assumptions. As new data comes in, are they still valid and still accurate? Checking for "model drift" gets harder the less you remember about the details of the model, and with annual data updates that's a lot of time to forget! It might take a lot of documentation/tests to remember what is going on. Basically there's a quantity/quality tradeoff here.

aesharpe · 2023-02-06T19:34:56Z

aesharpe
Feb 6, 2023
Collaborator

Whew, lots of good info here. I think you're asking the right questions @bendnorman and I agree with your feedback @TrentonBush. Here's my 2 cents:

How Many DBs should we have?

We've already gotten some feedback from the Open Street Map community that PUDL is dumping a lot of information on users without the tools to wade through it productively. If we triple the amount of tables that are easily available, we'll need to be even more cognizant of this.

I like the idea of having three databases:

pudl_raw.sqlite: The raw data from FERC and EIA (I agree with what @TrentonBush was saying about the utility (lol) of our raw FERC and EIA data.)
pudl_clean.sqlite: (Trenton's Data Warehouse) The normalized data that is now pudl.sqlite
pudl_analysis.sqlite: (Trenton's Data Mart) The analysis-ready output tables that are denormalized / contain imputations etc. I also agree with @TrentonBush that we shouldn't try to separate the denormalized output tables from the analysis tables and that we should really scrutinize which analysis we currently have that we would consider stable and willing to publish/maintain. Also, some of the cleaning we do in the transform step definitely verges on analysis-level cleaning, i.e., making concerted decisions on behalf of the data user. (I'm thinking about things like the serious amount of string cleaning or unit-fixing that occurs). Our documentation should make it very obvious if a column has been altered by PUDL or not and in what way. I can't state this enough.

I think that the landing page on our documentation (as well as Datasette) should very clearly explain the differences between said databases. I think we should encourage folks to use pudl_analysis.sqlite. Heck, I almost think we shouldn't put pudl_clean.sqlite on the Datasette at all...(hot take) just to avoid confusion. Really, pudl_clean.sqlite is just like the backend for pudl_analysis.sqlite. We may not be able to use a schema for the SQL db, but we can fake a schema in the documentation by doing a good job of organizing the tables and explaining what they are for users.

@bendnorman I think you're right that if we have databases labeled with the source, people will go there first, even if it's not what they want. If we do decide to go by source, I'd be in favor of adding the prefix raw_ to avoid insinuating that good clean data is in there. EX: raw_ferc1_xbrl.sqlite

Metadata

It's possible that we could have to have slightly different metadata for pudl_clean.sqlite tables than pudl_analysis.sqlite tables due to the inclusion of imputations or assumptions or cleaning mechanisms that weren't previously there. That's just something to consider, I'm not sure how to handle it.

I agree with @TrentonBush that we can probably forego metadata for the raw tables (unless it's easy to pull like xbrl). But rather we can explain some of the tricky nuances in the database pages on read the docs.

4 replies

TrentonBush Feb 7, 2023

I like the hot take of pointing Datasette to the output tables! They seem like the more widely used pieces of info.
Also agree that making clear which columns are heavily modified vs nearly raw is important.

bendnorman Feb 7, 2023
Maintainer Author

It seems like people are in agreement that we should evaluate our analysis tables and decide if they should actually be output tables or steps in the ETL.

I agree we should point folks toward the output tables over the normalized tables once they are in the database. Creating a separate database could be a good way to do that. However, we are currently planning on using SQL views for a lot of the output tables which need to be in the same database as the normalized data. A view is just a SQL statement stored in the database, it's not a copy of the underlying data. This allows us to distribute more tables without increasing the size of the database.

I see two options for distributing the normalized and output tables:

Have separate databases for the normalized and output tables. This would be convenient because users could just interact with the output tables and not have to interface with the normalized data. This option would make it difficult for users to understand how the denormalized tables are created and we wouldn't be able to take advantage of SQL views. The size and interpretability of the denormalized tables might not be important to our users.
Keep the normalized and output tables in one database. This would allow us to take advantage of SQL views. We'd have to document which tables are normalized and denormalized. We could adopt a naming convention to make this distinction.

aesharpe Feb 8, 2023
Collaborator

ohh I forgot about the views thing. Maybe some sort of naming convention, as you mention, would help. At the very least, explaining to users the difference between the normalized and non-normalized tables.

TrentonBush Feb 8, 2023

I get the elegance of SQL views but I think if/where the use of SQL views conflicts with our data organization goals, data organization should take precedence

katie-lamb · 2023-02-06T19:36:10Z

katie-lamb
Feb 6, 2023
Collaborator

For the most part I agree with Trenton about eliminating analysis tables and having that code fit into either the data warehouse process or an analysis-ready data mart/output table. To use the plant parts list as an example, last week a fairly non-technical potential user working for an environmental non profit asked how to create and view that table. I would have thought that the actual process of creating the table would be too complicated to explain but this user understood how and why it was created (because she had tried herself) and the biggest barrier was actually just accessing the table and having better documentation for the table. I realize that the plant parts list is already in PudlTabl but it used to just be in pudl.output and still seems like a relevant example of a more python heavy and specialized table that can still belong as a PUDL data mart table.

If the raw data was in different databases, then I it could maybe be called eia860_raw.sqlite to clearly differentiate from the EIA 860 data in pudl.sqlite. In general I think we can do a better job differentiating between what is raw data and how to get it, and what is in PUDL. That distinction is mostly for non-technical users in my experience and is more of a documentation issue. I think it's fine to have the data warehouse and data mart tables in the same database. Not opposed to the idea of dumping all the raw data that isn't easy to access (FERC, EIA) into one raw database instead of different databases for each source.

It does seem like SQLite will still do the trick. I guess I'm not sure how quickly we anticipate pudl.sqlite growing and at what point it is a hassle to distribute due to size, but I do think that new data integration should be a bigger priority for us in the near future.

Datasette
This is our first point of interaction for non-technical users since they often don't want to pick through our docs. I agree with Austen that the language and description on Datasette urgently needs to be improved. From almost all the user/client meetings I've had recently they've been confused about what is what on Datasette (especially with all the new FERC links), and how to download (it's all the way at the bottom and easy to miss). While I don't think that the data warehouse and data mart tables need to be separate databases, I do agree with Austen that some sort of distinction between should be explained in Datasette. For the most part, Datasette users will want denormalized tables and it should be obvious to them where and how to access these tables. In some cases, I feel like users might not even understand what a normalized table is and why those tables don't have all the columns that they want.

Personally I think no matter what our data distribution structure decision is, we need to put more emphasis on clear instructions. To me, I think this means more clearly written instructions on Datasette and our docs could be improved by reducing/clarifying the instructions on the initial pages.

3 replies

aesharpe Feb 6, 2023
Collaborator

I totally agree that plenty of users will not understand the distinction between a "normalized" and "denormalized" version of the data.

bendnorman Feb 7, 2023
Maintainer Author

Interesting to hear your feedback from potential users/clients about onboarding difficulties. I've heard similar feedback from other folks.

zaneselvans Feb 8, 2023
Maintainer

Ultimately don't we want all of the useful data to be in the / a database with good documentation? We don't want anybody to have to deal with installing a bunch of bespoke python packages. If we distribute the databases, and have good data dictionaries / metadata that explain the meaning of the columns and what can be found in each table, then we're good.

Much of the Dagster refactor is about getting to where we can actually write all of the useful data we've got already into a database that we can distribute, and having shared structures that are used across the board for adding / maintaining new data that we want to distribute.

We set up Datasette because it provided a non-technical user a view into the data we were already producing (they had no idea what to do with an SQLite file), and because it was essentially zero marginal work to publish our existing SQLite DBs using Datasette. We can customize it extensively if we want to -- hide tables that aren't easy for users to understand, change where table and column definitions appear, etc.

cmgosnell · 2023-02-07T22:30:37Z

cmgosnell
Feb 7, 2023
Collaborator

db org/how many dbs/how to distinguish them

I agree with a lot of what has been said already so i'll try not to repeat things except at the high level. I like the idea of separating the dbs into the tree stages trenton suggested and austen reiterated (raw, warehouse, mart). I could imagine publishing other stages between raw and warehouse - like all the pre-harvest eia tables because people always ask about that, but i think that would be for the intrepid user/ourselves debugging problems rather than more outward-facing.

question for multi-dbs/schemas/datasette

I'm hearing it would be nice structurally to have everything in the same db with some nested schemas to distinguish between these various types of data. But sqlite doesn't support schemas and we rely heavily on sqlite for publishing to datasette which we value. (we could be wedded to sqlite for other reasons I'm not sure aware of as well!)... Could we run our pipeline spitting out a postgres db or some other sql db that supports schemas and then for datasette specifically, grab the schemas that we actually want to publish and just convert them? That sounds like it should be straightforward to me at the high level but I really have no idea.

What is "analysis"/what should warehouse vs mart?

The main question that trenton raised that I'd love to get more clarity on is where some of the imputations should live. What is too complicated/too opinionated? I tried to make a list of things we currently do in PudlTabl (outside of just de-normalization) in the output directory rn:

output.eia860.assign_unit_ids: give all generators unit_id_pudl's - which builds upon transform.eia._boiler_generator_assn
output.eia860.fill_generator_technology_description: fills in missing values
output.eia860.fill_in_missing_ba_codes: fills in missing values
all throughout output.eia923 and output.ferc1: calculating simple (simple as in multiplication or division of mostly two columns) values (i.e. fuel_mmbtu_per_unit, fuel_consumed_mmbtu, total_fuel_cost, sulfur_content_pct, ash_content_pct, opex_total, opex_total_nonfuel, capacity_factor etc).
Fill in missing fuel costs: by output.eia923._impute_via_bulk_elec or via pudl.helpers.fillna_w_rolling_avg
output.ferc1.fuel_by_plant_ferc1: guess the primary fuel per plant based on fuel use.
lots of cram in output.ferc714 that zane would know more about than me.

Before making this list I don't think I would have said "let's move this all into the warehouse". We've been semi-religious about keeping imputations or calculations out of pudl.sqlite in the past. This has been very helpful for communications because it is clear where the mucking with the data starts. But I think a loooot of these things could easily be piped over into the main transform step. I think the main need in doing it would be in communication/documentation about which values are "original" versus which are imputations, which I think is a problem whether or not we pull these things into the transform step or into a more widely published mart.

Cram we are doing in the analysis directory is a often much more opinionated and I agree that more widely publishing these things is opening us up to wider scrutiny. but ime a lot of these compiled tables are very helpful to a wide array of current or possible users. Many of these output were developed for a specific project or with a specific client but with the intention of having broader use. The allocation of net generation was done in collab with Greg S & RMI but Greg M pretty immediately picked it up and added the allocation of fuel (sidenote: this process is technically an imputation...). The EIA plant part list + the record linkage between FERC and EIA is a thing that I've been asked about a zillion times over the years. iirc the service_territory, and state_demand modules were both developed in collaboration with GridLab/LBNL with the desire to get hourly state or county level demand. That required county level service territory maps, which many folks have asked about or been excited about. And hourly demand is a key input to capacity expansion or economic dispatch models. I'm not sure we have many current users of many of these things right now, but I don't think we should discount the wide-spread potential for use of these analysis outputs. There are other things that we've done in a more one-off way for clients and note integrated into pudl that seem like they could be highly useful as more accessible, continually produced analysis tables. The whole post sloan conf rff modeling inputs grant proposal was basically a big "integrate stuff we've mostly already done in a one-off way" proposal. Because so much of the more analysis-heavy things we've done have been just to make our data either more complete or more granular so that folks can either do their own analysis or run their own models.

docs/guiding folks

Hard agree with the need for very clear and heavy-guiding documentation on datasette. Agreed that all of the ferc dbs are obviously helpful to have up there but really clutter it up and make it un-obvious where the PUDL is/where I think we are mostly trying to direct folks.

3 replies

bendnorman Feb 8, 2023
Maintainer Author

question for multi-dbs/schemas/datasette

The ETL could load the tables to their respective schemas to a database that supports schemas. We could then create sqlite databases for each schema which datasette could then point to. This might introduce some additional complexity for development. For example, we'd probably want to start using docker to create separate containers for the postgres database and the ETL code. It'd probably be simpler to have the ETL just write to separate sqlite databases.

Analysis

Interesting to see how many of the analysis modules are mostly doing imputation. Have PUDL's imputed values been scrutinized in the past?

Assuming a minority of users disagree with our imputation, I'm tempted to just move all imputation analysis logic into the ETL. If a user disagrees with the imputed values, they can fork the repo and run the ETL with their own imputation logic or update the logic in the main repo. We could also retain a raw and imputed version of each column, though this might clutter our tables.

TrentonBush Feb 8, 2023

Big agree that pre-harvesting data should be available to users! Highly opinionated processes like harvesting and imputation will always be (and should be!) less trusted by users. For that reason I think it's important to make the previous version available to people. For low to moderate numbers of columns, I like having both in the same table (with clear labels on which is modeled and which is raw) because it facilitates and encourages comparison. I could see a large number of columns getting cluttered and maybe warranting a separate table.

I think there is a lot of reorganization we could do to better align our tables with new structural/organization goals, we just have to decide how much of that we want to take on.

zaneselvans Feb 8, 2023
Maintainer

As we highlighted in the Sloan/RFF grant, we are (and should be!) directly competing with services like S&P -- and they make all kinds of opinionated imputation / calculation decisions inside a black box (with known errors), and people pay tens of thousands of dollars a year for that black box because it's immediately usable and good enough.

Even if we don't do it right now, I think putting the derived / imputed values into their own skinny tables which can have rich annotation about how the values were calculated, and then merging the "final" values into the denormalized tables that people are actually expected to use day-to-day seems like a good practice.

@TrentonBush made a comment on some other issue I can't find right now that I think by far the highest value of what we're doing with this refactor right now is getting rid of the software layer as a requirement to access all of this wonderful data. Even if we just shove all of the existing outputs & analysis into the DB as-is, that will be a gigantic improvement in data accessibility. We could do that now if it's easy, and then address finer data design points in subsequent iterations.

zaneselvans · 2023-02-09T15:50:13Z

zaneselvans
Feb 9, 2023
Maintainer

How does this conversation relate to and diverge from the one we had last year in #1838?

0 replies

aesharpe · 2023-02-27T21:06:38Z

aesharpe
Feb 27, 2023
Collaborator

Relevant but siloed question: regardless of how we structure the database, we will likely have more than one SQL db. At the very least, we'll still have PUDL and then the raw FERC dbs. Where in the code should / does db-level metadata live? Such as descriptions of the contents of each db? Does it make sense to live in constants.py? @zaneselvans

1 reply

zaneselvans Mar 2, 2023
Maintainer

Right now I don't think we have any DB level metadata, but I agree it should exist somewhere and probably be used to annotate multiple different outputs like:

The front page of Datasette, where the various databases are listed (this would be output by the datasette_metadata_to_yml script).
The datapackage.json files that annotate each of the XBRL databases produced by @zschira in the XBRL to SQLite extraction.
Potentially future datapackage.json files that are constructed from our Package metadata to annotate pudl.sqlite in the distributed build outputs.
Potentially future datapackage.json files that are constructed in an automated way from the SQLAlchemy schemas that we infer for the SQLite DBs we build from Visual FoxPro / DBF data for the older FERC 1, FERC 2, etc.

As to where this information should live, the "database" level metadata structure that we currently have is Package and it has a description field, which seems like the right home for this information. However, we don't have a particularly dynamic way of constructing Packages, and I think that Zach used a different frictionelss datapackage metadata structure in the XBRL to SQLite extractions.

We could hack it into pudl.metadata.constants for the moment, but we should figure out how we want to manage and distribute database-level metadata in a more uniform way going forward.

Catalyst Cooperative

How to structure database, metadata, documentation to accommodate for new tables #2275

bendnorman Feb 3, 2023 Maintainer

Metadata

Replies: 7 comments · 16 replies

bendnorman Feb 4, 2023 Maintainer Author

Noramlized and output tables

Raw and interim tables

TrentonBush Feb 4, 2023

SQLite For The Forseeable Future

Three Tiers of Output

Raw Tables

Warehouse Tables

Analysis-ready Tables (Data Mart)

Separate "Analysis Tables" Shouldn't Exist

bendnorman Feb 7, 2023 Maintainer Author

Raw and interim data

Data Mart / Warehouse

Analysis Tables

cmgosnell Feb 7, 2023 Collaborator

Raw and interim data

Analysis Tables

zaneselvans Feb 8, 2023 Maintainer

Raw and interim data

Data Warehouse / Mart

Analysis Tables

TrentonBush Feb 8, 2023

TrentonBush Feb 9, 2023

aesharpe Feb 6, 2023 Collaborator

How Many DBs should we have?

Metadata

TrentonBush Feb 7, 2023

bendnorman Feb 7, 2023 Maintainer Author

aesharpe Feb 8, 2023 Collaborator

TrentonBush Feb 8, 2023

katie-lamb Feb 6, 2023 Collaborator

aesharpe Feb 6, 2023 Collaborator

bendnorman Feb 7, 2023 Maintainer Author

zaneselvans Feb 8, 2023 Maintainer

cmgosnell Feb 7, 2023 Collaborator

db org/how many dbs/how to distinguish them

question for multi-dbs/schemas/datasette

What is "analysis"/what should warehouse vs mart?

docs/guiding folks

bendnorman Feb 8, 2023 Maintainer Author

question for multi-dbs/schemas/datasette

Analysis

TrentonBush Feb 8, 2023

zaneselvans Feb 8, 2023 Maintainer

zaneselvans Feb 9, 2023 Maintainer

aesharpe Feb 27, 2023 Collaborator

zaneselvans Mar 2, 2023 Maintainer

bendnorman
Feb 3, 2023
Maintainer

Replies: 7 comments 16 replies

bendnorman
Feb 4, 2023
Maintainer Author

TrentonBush
Feb 4, 2023

bendnorman Feb 7, 2023
Maintainer Author

cmgosnell Feb 7, 2023
Collaborator

zaneselvans Feb 8, 2023
Maintainer

aesharpe
Feb 6, 2023
Collaborator

bendnorman Feb 7, 2023
Maintainer Author

aesharpe Feb 8, 2023
Collaborator

katie-lamb
Feb 6, 2023
Collaborator

aesharpe Feb 6, 2023
Collaborator

bendnorman Feb 7, 2023
Maintainer Author

zaneselvans Feb 8, 2023
Maintainer

cmgosnell
Feb 7, 2023
Collaborator

bendnorman Feb 8, 2023
Maintainer Author

zaneselvans Feb 8, 2023
Maintainer

zaneselvans
Feb 9, 2023
Maintainer

aesharpe
Feb 27, 2023
Collaborator

zaneselvans Mar 2, 2023
Maintainer