How to access full `pudl.sqlite` in a standard GitHub Runner #3071

bendnorman · 2023-11-21T17:51:55Z

bendnorman
Nov 21, 2023
Maintainer

Since converting PudlTabl methods to dagster assets that write data to pudl.sqlite, pudl.sqlite has grown by about 10x. While having all of our data in a database rather than installing the pudl python package is more convenient for users, downloading an 11 GB database is cumbersome. This is especially a problem if you want to incorporate the full pudl.sqlite file into CI running on a standard Github Runner which only has 14 GB of disk space.

There are a few future changes that should reduce the size of the database:

Removing _out_* assets from the database. These assets are intermediate steps to a more complete asset users should be interacting with. There are some _out_* in the database because they were available in PudlTabl and we want to continue to support PudlTabl until users migrate to pulling data directly from the database.
Converting some assets to SQL views. There are some assets that are simple joins that could be SQL views which would not increase the database's size.

Even if we reduce pudl.sqlite's size by making these changes, there's no guarantee we keep the size down as we add more data sources. Here are a few ideas on how we can make this "medium" sized data easier to access:

Create a pudl.sqlite "lite" database which only includes out_* tables that include all of the most useful information.
Have users query pudl.sqlite using the datasette API. Not the easiest method for pulling data into dataframes. We also don't have a great means of versioning the API.
Create a collection of parquet files for each table in pudl.sqlite. This would allow users to easily pull very space efficient files for individual tables into DataFrames. One potential issue with this solution is that users might end up pulling parquet files from different data versions.
Convert pudl.sqlite to a duckdb file. Given the design goals of duckdb it seems like an ideal tool for our type of data. However, I don't think duckdb's compression is as efficient as parquet's, the tooling only supports a database created with the same version and is an additional tool users will have to become familiar with.
Providing a compressed version of pudl.sqlite. This will make it easier for folks to download but will still consume most of the GitHub runner disk space when unzipped.

cmgosnell · 2023-11-21T18:12:28Z

cmgosnell
Nov 21, 2023
Collaborator

I'm not sure this is a good idea but i could imagine an additional idea being:

Create a pudl.sqlite "fast" database which only includes the few years we have in our fast settings file.

I personally think 1 or 6 but sounds the most ideal. 1 because that might be a nice way for most users to access our data regardless of github runners, but this could pose a problem if anyone wants to rely on the tidy normalized tables. and 6 because we already consistently build this db locally and for our CI so it sounds straightforward, but it would mean downstream users' tests would be locked to whatever years/partitions that we are currently testing.

10 replies

gschivley Nov 22, 2023

@zaneselvans I'm all for using duckdb storage but is it mature enough to depend on yet? The one thing you can't find in their documentation is info about duckdb file storage. Other than the warning that it will be in flux until 1.0.0.

jdangerx Nov 22, 2023
Maintainer

I think the idea with DuckDB here would be to use it just as a frontend for querying a bunch of different Parquet files - so the storage would be in a nice stable format. And if DuckDB implodes for some reason there are other tools to access parquet files.

If we wanted to maintain some set of SQL constraints on top of the pile of Parquet files, then I think we would need to distribute a duckdb file... which then does run into your stability concerns. But we could just not worry ourselves with SQL constraints for now.

zaneselvans Nov 22, 2023
Maintainer

I think we definitely need to keep worrying about the SQL constraints on the data production side, even if they're not encoded into all of the data products which we distribute, since they help us catch many, many internal data inconsistencies and bugs. But if we keep producing a unified pudl.sqlite(.gz) output alongside a directory of Parquet files derived from the same data, then we would know that those internal constraints had been satisfied.

jdangerx Nov 22, 2023
Maintainer

Yeah, totally agreed. In theory we could maintain these constraints while moving over to pudl.duckdb but after a whole bunch of research it still feels like a lot of effort for questionable gain on the production side. But Parquet files on the distribution side seems like a no-brainer at this point.

We could potentially produce a constraints+schema only version of the duckdb that downstream users could then import data into - not really sure what that buys the users, though - and definitely is a weird layer of complexity. So I'm inclined against that.

zaneselvans Nov 22, 2023
Maintainer

I don't think it would be too much work to output a DuckDB schema. We should replace all of our to_sql() methods with to_sqlite() and create to_duckdb() methods that translate the types and constraints correctly for the DuckDB dialect of Postgres. At which point we would basically have the ingredients for a DuckDBIOManager if we wanted to implement it.

Many of these tasks sounds like they could be a good fit for a data-warehouse / SQL savvy open source contributor if say Brent or Jan or someone like that were interested.

gschivley · 2023-11-22T15:58:20Z

gschivley
Nov 22, 2023

@bendnorman you mention in 3)

One potential issue with this solution is that users might end up pulling parquet files from different data versions.

How are data versions tracked right now? When I look at the nightly build or Datasette tables there doesn't seem to be an identifier with data versioning. Is it just that a complete pudl.sqlite database is going to be internally consistent, and individual tables downloaded at different times might not be?

I have users that access a handful of PUDL tables for things like unit definitions, capacity, heat rates, historical capacity factor, etc. I'm only using the annual EIA data and supplement it with 860m from my own processing pipeline. At the moment I ask users to download PUDL from Zenodo and they probably use the same database until a change in table/column names prompts me to tell everyone to download a new version (this may have only happened once...).

My ideal workflow would be fetching data as users need it to minimize the up-front data compilation needs. Each time a user runs my software it would run a quick query (count of rows or max report_date) to see if the cloud version is updated from the local. Looks like it's easy to run SELECT COUNT(*) FROM <file.parquet> with duckdb, which uses the file metadata rather than downloading any of the data. So individual parquet files would be great from my perspective. A zip file of individual parquet files could also work if there were a metadata file describing when each was last modified.

2 replies

jdangerx Nov 22, 2023
Maintainer

How are data versions tracked right now? When I look at the nightly build or Datasette tables there doesn't seem to be an identifier with data versioning. Is it just that a complete pudl.sqlite database is going to be internally consistent, and individual tables downloaded at different times might not be?

Yeah, exactly right. Data versions aren't really tracked but we rely on the bundling that pudl.sqlite provides to make sure everything is internally ocnsistent.

My ideal workflow would be fetching data as users need it to minimize the up-front data compilation needs. Each time a user runs my software it would run a quick query (count of rows or max report_date) to see if the cloud version is updated from the local. Looks like it's easy to run SELECT COUNT(*) FROM <file.parquet> with duckdb, which uses the file metadata rather than downloading any of the data. So individual parquet files would be great from my perspective. A zip file of individual parquet files could also work if there were a metadata file describing when each was last modified.

If downloading multiple tables, you should probably redownload all of them every time any of them appears out of date, to make sure that you're getting internally consistent data. But otherwise this sounds reasonable! You might want to use modified-time or file-hash instead of a query to determine staleness.

bendnorman Nov 22, 2023
Maintainer Author

We haven't done regular data releases of PUDL data (about once a year at this point) which is why datasette and our documentation refer to data produced by the dev branch. We haven't done more frequent releases because packaging the software and data and releasing on Zenodo is burdensome, which is part of the reason for moving to data only releases. Our goal with data only releases is to have bi weekly or monthly data releases with version numbers. This way, users can pin to specific data versions and won't get sabotaged by any breaking changes on dev.

jdangerx · 2023-11-22T16:30:46Z

jdangerx
Nov 22, 2023
Maintainer

It sounds like there's appetite for distributing a pile of parquet files, while punting any decisions about DuckDB usage on the Catalyst side - people are obviously free to use whatever tooling fits the bill (ha) for accessing data in parquet files.

It also sounds like we should continue distributing the full SQLite file for people who want SQL constraints, and certainly continue using it internally so we can use those constraints while writing the dang data.

I'm hearing a little less support for the "pre-subsetted" SQLite files (out_*, etl_fast) and for compressing the existing SQLite files, though I know compressing the existing files will at least make it easier for folks like @zaneselvans and @aesharpe who often have sketchy internet connections to download the DB.

I sort of think all of these are pretty easy to generate from our existing uncompressed complete SQLite output:

a pile of parquet files
compressed SQLite files
a few pre-subsetted SQLite files

And the hard thing is "figure out if we want to move pudl.sqlite completely over to duckdb."

So my proposal is to do the three easy things but not the hard thing.

In terms of ordering I think we should do the Parquet stuff first since that seems to solve more downstream problems than the other two.

3 replies

zaneselvans Nov 22, 2023
Maintainer

👍🏼 to compressed SQLite. I think we should probably do this first since it ought to be a one-liner in the nightly build script and reduce download time/size by like 75% without materially affecting data accessibility.
👍🏼 to directory-of-parquet-files output alongside SQLite, since this opens up a lot of very convenient a-la-carte data usage and cloud-native analysis, and costs us next to nothing in storage & compute. I suspect it will be a bit of work to get the PyArrow schema generation working and to do double outputs during the ETL (with both SQLite and Parquet IOManagers). Also note that without significant additional work, the data validations will continue to run only on the SQLite outputs, and if there are issues specific to the Parquet outputs we won't know about them. But if we get the NSF resources to invest in the data validation refactor we'll hopefully be able to address this next year?
👎🏼 to the pre-subsetted SQLite files. It seems like substantial added potential complexity / confusion around where data comes from and what data you can assume is available, and I think many of the benefits this would provide are available from simply compressing the SQLite database, or allowing individual table access through Parquet.

If we get the Parquet outputs working, we and users will have lots of opportunity to play with DuckDB querying them either locally or remotely, and if/when their storage format calms down all of that experience should translate directly to using a unified DuckDB output that could potentially replace the SQLite output.

Another thing that would help all of these options be more legible is publishing structured metadata alongside the main PUDL data outputs. Upgrading to frictionless v5 would allow us to publish datapackage.json files with all the constraints and table/column descriptions annotating both the SQLite DB and a parquet directory.

jdangerx Nov 22, 2023
Maintainer

Yeah, pre-subsetted SQLite files definitely seems like the iffiest one. I was just thinking about how we'd have to rewrite the data access documentation and that seemed like a headache. Down to just do parquet + compression.

bendnorman Nov 22, 2023
Maintainer Author

I'm in favor of distributing a compressed SQLite database and a directory of parquet files! My only concern is having the .sqlite db and the parquet files will confuse users. This can probably be resolved by having clear docs: "If you want to do X,Y,Z use .sqlite db. If you want to do A, B, C, use parquet files.".

bendnorman · 2023-11-22T19:31:22Z

bendnorman
Nov 22, 2023
Maintainer Author

I ran sqlite3_analyzeron the most recent full pudl.sqlite and it looks like the two space saving measures I mentioned in the original comment won't make a significant dent in the size of the db. Most of the largest tables are not intermediate tables or good candidates for SQL views.

Intermediate output tables (denoted by _out_*) that we plan to remove account for 9.2% of the database. The 8 largest tables account for 60% of the database. None of these are great candidates for SQL views.

Sounds like we've settled on compressing the SQLite db and distributing collections of Parquet files but I thought this was mildly helpful information. I've attached the full disk utilization report:
pudl_sqlite_analyze.txt

1 reply

zaneselvans Nov 22, 2023
Maintainer

Thanks for checking!

I'm pretty happy with our action items.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

How to access full `pudl.sqlite` in a standard GitHub Runner #3071

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Catalyst Cooperative

How to access full pudl.sqlite in a standard GitHub Runner #3071

bendnorman Nov 21, 2023 Maintainer

Replies: 4 comments · 16 replies

cmgosnell Nov 21, 2023 Collaborator

gschivley Nov 22, 2023

jdangerx Nov 22, 2023 Maintainer

zaneselvans Nov 22, 2023 Maintainer

jdangerx Nov 22, 2023 Maintainer

zaneselvans Nov 22, 2023 Maintainer

gschivley Nov 22, 2023

jdangerx Nov 22, 2023 Maintainer

bendnorman Nov 22, 2023 Maintainer Author

jdangerx Nov 22, 2023 Maintainer

zaneselvans Nov 22, 2023 Maintainer

jdangerx Nov 22, 2023 Maintainer

bendnorman Nov 22, 2023 Maintainer Author

bendnorman Nov 22, 2023 Maintainer Author

zaneselvans Nov 22, 2023 Maintainer

How to access full `pudl.sqlite` in a standard GitHub Runner #3071

bendnorman
Nov 21, 2023
Maintainer

Replies: 4 comments 16 replies

cmgosnell
Nov 21, 2023
Collaborator

jdangerx Nov 22, 2023
Maintainer

zaneselvans Nov 22, 2023
Maintainer

jdangerx Nov 22, 2023
Maintainer

zaneselvans Nov 22, 2023
Maintainer

gschivley
Nov 22, 2023

jdangerx Nov 22, 2023
Maintainer

bendnorman Nov 22, 2023
Maintainer Author

jdangerx
Nov 22, 2023
Maintainer

zaneselvans Nov 22, 2023
Maintainer

jdangerx Nov 22, 2023
Maintainer

bendnorman Nov 22, 2023
Maintainer Author

bendnorman
Nov 22, 2023
Maintainer Author

zaneselvans Nov 22, 2023
Maintainer