How to access full pudl.sqlite
in a standard GitHub Runner
#3071
Replies: 4 comments 16 replies
-
I'm not sure this is a good idea but i could imagine an additional idea being:
I personally think 1 or 6 but sounds the most ideal. 1 because that might be a nice way for most users to access our data regardless of github runners, but this could pose a problem if anyone wants to rely on the tidy normalized tables. and 6 because we already consistently build this db locally and for our CI so it sounds straightforward, but it would mean downstream users' tests would be locked to whatever years/partitions that we are currently testing. |
Beta Was this translation helpful? Give feedback.
-
@bendnorman you mention in 3)
How are data versions tracked right now? When I look at the nightly build or Datasette tables there doesn't seem to be an identifier with data versioning. Is it just that a complete I have users that access a handful of PUDL tables for things like unit definitions, capacity, heat rates, historical capacity factor, etc. I'm only using the annual EIA data and supplement it with 860m from my own processing pipeline. At the moment I ask users to download PUDL from Zenodo and they probably use the same database until a change in table/column names prompts me to tell everyone to download a new version (this may have only happened once...). My ideal workflow would be fetching data as users need it to minimize the up-front data compilation needs. Each time a user runs my software it would run a quick query (count of rows or max |
Beta Was this translation helpful? Give feedback.
-
It sounds like there's appetite for distributing a pile of parquet files, while punting any decisions about DuckDB usage on the Catalyst side - people are obviously free to use whatever tooling fits the bill (ha) for accessing data in parquet files. It also sounds like we should continue distributing the full SQLite file for people who want SQL constraints, and certainly continue using it internally so we can use those constraints while writing the dang data. I'm hearing a little less support for the "pre-subsetted" SQLite files ( I sort of think all of these are pretty easy to generate from our existing uncompressed complete SQLite output:
And the hard thing is "figure out if we want to move pudl.sqlite completely over to duckdb." So my proposal is to do the three easy things but not the hard thing. In terms of ordering I think we should do the Parquet stuff first since that seems to solve more downstream problems than the other two. |
Beta Was this translation helpful? Give feedback.
-
I ran Intermediate output tables (denoted by Sounds like we've settled on compressing the SQLite db and distributing collections of Parquet files but I thought this was mildly helpful information. I've attached the full disk utilization report: |
Beta Was this translation helpful? Give feedback.
-
Since converting
PudlTabl
methods to dagster assets that write data topudl.sqlite
,pudl.sqlite
has grown by about 10x. While having all of our data in a database rather than installing thepudl
python package is more convenient for users, downloading an 11 GB database is cumbersome. This is especially a problem if you want to incorporate the fullpudl.sqlite
file into CI running on a standard Github Runner which only has 14 GB of disk space.There are a few future changes that should reduce the size of the database:
_out_*
assets from the database. These assets are intermediate steps to a more complete asset users should be interacting with. There are some_out_*
in the database because they were available inPudlTabl
and we want to continue to supportPudlTabl
until users migrate to pulling data directly from the database.Even if we reduce
pudl.sqlite
's size by making these changes, there's no guarantee we keep the size down as we add more data sources. Here are a few ideas on how we can make this "medium" sized data easier to access:pudl.sqlite
"lite" database which only includesout_*
tables that include all of the most useful information.pudl.sqlite
using the datasette API. Not the easiest method for pulling data into dataframes. We also don't have a great means of versioning the API.pudl.sqlite
. This would allow users to easily pull very space efficient files for individual tables into DataFrames. One potential issue with this solution is that users might end up pulling parquet files from different data versions.pudl.sqlite
to a duckdb file. Given the design goals of duckdb it seems like an ideal tool for our type of data. However, I don't think duckdb's compression is as efficient as parquet's, the tooling only supports a database created with the same version and is an additional tool users will have to become familiar with.pudl.sqlite
. This will make it easier for folks to download but will still consume most of the GitHub runner disk space when unzipped.Beta Was this translation helpful? Give feedback.
All reactions