How to structure database, metadata, documentation to accommodate for new tables #2275
Replies: 7 comments 16 replies
-
Noramlized and output tablesI think it's important we separate raw and partially cleaned tables from our clean normalized and output tables. We don't want the clean stuff to get jumbled up with the dirty stuff. I think pudl.sqlite should continue to be our primary data product that contains these types of tables:
Ideally, SQLite would support schemas and we could structure the database like this:
SQLite doesn't support schemas so we'd have to establish this hierarchy with a naming convention or make it clear in the documentation how tables relate to one another. Raw and interim tablesThe organization of the raw and interim tables isn't as clear to me. We probably want to keep the raw extracted tables separate from the interim tables. This way if people just want access to the raw data they don't have to shift through partially cleaned tables. If we mix interim and raw tables together it might not be clear to users which tables came from the original dbf, xbrl, or excel files. I think it makes sense to create a database for each data source so we have a standard distribution format for raw data. For example, we'll have As for interim tables... I'm not sure where these should live. Currently, the only interim tables we produce are the partially cleaned unharvested EIA tables. Where should these live? |
Beta Was this translation helpful? Give feedback.
-
I think this question has different answers depending on whether we're talking about our tables as they exist today or an ideal, reorganized set of tables that may exist in the future. I'll try to only talk about the tables that exist today. SQLite For The Forseeable FutureMaybe in the future we'll have to move away from local compute and storage, but it seems like sqlite is still up to the task. From one of Simon Willison's blog posts:
So maybe separate sqlite databases is fine? Certainly for distribution it seems fine. Three Tiers of OutputIn my view, there are three levels of table that are valuable to distribute to users: fully raw, warehouse (cleaned and reorganized), and analysis-ready (denormalized). Other interim tables may have internal value but don't warrant a commitment to documentation or stability like we expect for external facing tables. Raw TablesRaw data is always valuable for analysis because it has passed through fewer rounds of potentially biased or erroneous interpretation. Reading files and transforming them into a database is one of those potentially biased or erroneous actions, so in general I don't think making our own versions of "raw" data is useful or desired. But I think FERC and EIA data is an exception -- offering compiled raw tables is valuable to users because those sources distribute their data in very inconvenient ways: obsolete databases (FERC .dbf.), specialist file formats (FERC .xbrl), a bazillion disconnected files (EIA .xls). The act of reading and compiling them is a useful service unto itself. If these institutions distributed their raw data in easy to read CSVs (like EPA CEMS) or a database, I don't think we should bother providing compiled raw data. I don't think it's worth committing to stability and documentation for something that has undergone so little custom work. So to save effort and avoid maintenance burden, I think we should refer people to the original documentation rather than duplicate it into our own version, and offer our internal metadata for technical users but not make any promises about it. Warehouse TablesWe haven't yet reached a common understanding of what a data warehouse is, but to me it means cleaned, normalized, enriched data that has undergone either uncontroversial transformations or, if there are opinionated transformations (like imputation), there is raw-er data available as a fallback. To keep this discussion focused on existing tables, I would call Analysis-ready Tables (Data Mart)Because this output is supposed to be the most convenient, it needs documentation and stability. These are currently the Separate "Analysis Tables" Shouldn't ExistWe don't seem to have clarity about when things belong in output tables vs analysis tables vs data enrichment that belongs in the data warehouse. I think our current concept of the data warehouse needs refinement. We seem to conceive of it as only transformed external data, so a lot of our enrichment operations get pushed into output or analysis tables. To me, only two of those three things actually exist -- analysis tables existing separately to the data mart and data warehouse is indicative of a schema/organizational problem:
Much of that analysis work seems untested and unmaintained anyway. It's a dangling reputational risk that we can get rid of. |
Beta Was this translation helpful? Give feedback.
-
Whew, lots of good info here. I think you're asking the right questions @bendnorman and I agree with your feedback @TrentonBush. Here's my 2 cents: How Many DBs should we have?We've already gotten some feedback from the Open Street Map community that PUDL is dumping a lot of information on users without the tools to wade through it productively. If we triple the amount of tables that are easily available, we'll need to be even more cognizant of this. I like the idea of having three databases:
I think that the landing page on our documentation (as well as Datasette) should very clearly explain the differences between said databases. I think we should encourage folks to use @bendnorman I think you're right that if we have databases labeled with the source, people will go there first, even if it's not what they want. If we do decide to go by source, I'd be in favor of adding the prefix MetadataIt's possible that we could have to have slightly different metadata for I agree with @TrentonBush that we can probably forego metadata for the raw tables (unless it's easy to pull like xbrl). But rather we can explain some of the tricky nuances in the database pages on read the docs. |
Beta Was this translation helpful? Give feedback.
-
For the most part I agree with Trenton about eliminating analysis tables and having that code fit into either the data warehouse process or an analysis-ready data mart/output table. To use the plant parts list as an example, last week a fairly non-technical potential user working for an environmental non profit asked how to create and view that table. I would have thought that the actual process of creating the table would be too complicated to explain but this user understood how and why it was created (because she had tried herself) and the biggest barrier was actually just accessing the table and having better documentation for the table. I realize that the plant parts list is already in If the raw data was in different databases, then I it could maybe be called It does seem like SQLite will still do the trick. I guess I'm not sure how quickly we anticipate Datasette Personally I think no matter what our data distribution structure decision is, we need to put more emphasis on clear instructions. To me, I think this means more clearly written instructions on Datasette and our docs could be improved by reducing/clarifying the instructions on the initial pages. |
Beta Was this translation helpful? Give feedback.
-
db org/how many dbs/how to distinguish themI agree with a lot of what has been said already so i'll try not to repeat things except at the high level. I like the idea of separating the dbs into the tree stages trenton suggested and austen reiterated (raw, warehouse, mart). I could imagine publishing other stages between raw and warehouse - like all the pre-harvest eia tables because people always ask about that, but i think that would be for the intrepid user/ourselves debugging problems rather than more outward-facing. question for multi-dbs/schemas/datasetteI'm hearing it would be nice structurally to have everything in the same db with some nested schemas to distinguish between these various types of data. But What is "analysis"/what should warehouse vs mart?The main question that trenton raised that I'd love to get more clarity on is where some of the imputations should live. What is too complicated/too opinionated? I tried to make a list of things we currently do in
Before making this list I don't think I would have said "let's move this all into the warehouse". We've been semi-religious about keeping imputations or calculations out of Cram we are doing in the docs/guiding folksHard agree with the need for very clear and heavy-guiding documentation on datasette. Agreed that all of the ferc dbs are obviously helpful to have up there but really clutter it up and make it un-obvious where the PUDL is/where I think we are mostly trying to direct folks. |
Beta Was this translation helpful? Give feedback.
-
How does this conversation relate to and diverge from the one we had last year in #1838? |
Beta Was this translation helpful? Give feedback.
-
Relevant but siloed question: regardless of how we structure the database, we will likely have more than one SQL db. At the very least, we'll still have PUDL and then the raw FERC dbs. Where in the code should / does db-level metadata live? Such as descriptions of the contents of each db? Does it make sense to live in |
Beta Was this translation helpful? Give feedback.
-
Now that we are using dagster we can persist interim, output, and analysis tables to a database so folks can access all of our data! This is great but poses some questions about how we want to structure our data products and documentation. Currently, our four types of data are distributed and documented in different ways:
pudl.sqlite
and documented in our data dictionary.PudlTabl
class which requires you to installing the pudl package. Some table and column-level documentation exist in doc strings. Some column-level descriptions can also be referenced in the pudl data dictionary.ferc1.sqlite
andferc*_xbrl.sqlite
databases and are partially documented in ferc db data dictionary.In theory, all of these tables can now be written to
pudl.sqlite
. This approach will likely overwhelm our users and clutter the database. The same goes for our data dictionaries.pudl.sqlite
for our normalized and output tables? For example, we would haveferc1_dbf.sqlite
,ferc1_xbrl.sqlite
,eia860.sqlite
,eia861.sqlite
... ,pudl.sqlite
.Metadata
Currently, we only have resource metadata for our normalized tables. We'll need to create resource metadata for new tables we want to save to the database so constraints are checked and dtypes remain consistent as the tables move to and from the database.
How do we want to structure our resource metadata sub-package to accommodate all of these new tables? We could add the raw and partially cleaned tables' metadata to the datasource's module in the resource subpackage and categorize the table type using the
etl_group
key. We could create a newmetadata.resources.outputs
module to store all of the output table metadata.Beta Was this translation helpful? Give feedback.
All reactions