Larger data distribution #2770

bendnorman · 2023-08-05T18:55:54Z

bendnorman
Aug 5, 2023
Maintainer

I've been thinking about how to distribute FERC EQR, which will probably be > 100 GB of parquet files. Ideally, there'd be a solution that:

allows users to query the > 100GB of data using SQL or a simple UI
doesn't require users to download all of the data
Incurs no or minimal costs to catalyst and user

This is hard! Either users will have to be comfortable downloading > 100 GBs of data and working with tools like dask and duckdb or be comfortable setting up, managing and using cloud infrastructure. Here are some options:

DuckDB WASM on Cloud Run

pros
- I don’t think people would have to download the data? Could the shell on Cloud Run pre-populated with the data?
- In theory, it would be super easy for folks to query the data with no cost to Catalyst or user
cons
- only get 2 GB of memory
- Not a lot of documentation on how to set up

DuckDB local + s3 bucket

pros
- No cost to catalyst or user
cons
- User has to install duckdb
- User has to download and import the EQR data (> 100 GB if they want everything)

Snowflake

Interesting free energy datasets available on Snowflake Marketplace
pros
- I think no or low cost for catalyst to host a dataset on marketplace. User covers the compute.
- No software installation or data to download for user
- Could offer paid offerings for PUDL data
Cons
- cost to the user and pricing is very confusing
- Requires Snowflake account setup

AWS Redshift + s3 bucket

pros
- no cost to catalyst
- No software installation or data to download for user
cons
- cost to user
- Requires user to setup AWS Redshift

Google BigQuery Public Data

pros
- no cost to catalyst to host the dataset
- No software installation or data to download for user
cons
- cost to user though pricing is more straightforward than Snowflake and $300 of credits
- Requires Google cloud account setup

Other ideas

Hosted clickhouse
Invite only BigQuery where catalyst covers the setup and storage/compute cost
Tad

My dream would be to host a duckdb WASM interface with the data pre-loaded, though based on my understanding of WASM, I don't think it can use local compute on remote data. The 2 GB memory limit is also a bummer.

The answer to this question probably depends on how people are interested in using the data. If folks just want some seller information for a given quarter, they can easily download one quarters worth of data and explore it using dask or duckdb. If they want to analyze all 10 years of data, they'll probably need a cloud provider.

Another consideration here is our storage limit on the AWS bucket. We only get 100 GB of storage though I think they will grant us 1 TB. Even with 1 TB of storage, we could only have about 4 data release versions at a time if we're distributing EQR, which might end up being about 100 - 200 GB.

TrentonBush · 2023-08-08T14:20:06Z

TrentonBush
Aug 8, 2023

Are these systems separate from the rest of PUDL, or is everything going to be in one place?

If it's all in one place, it'd be a nice bonus to use a db that supports FK constraints, purely for the sake of documentation. An ER diagram is worth a thousand words!
If it's separate, I think that defeats much of the purpose of integrating this data in the first place (it's not really "integrated").
Users love GUIs! Which of these have web apps or, less preferred, desktop software interfaces?

3 replies

bendnorman Aug 8, 2023
Maintainer Author

Agreed! I know BQ does not offer FK constraints but redshift and snowflake might. Duckdb for sure does.
What do you mean by "separate from the rest of PUDL"?
All of the cloud data warehouses offer UIs for writing SQL queries but I don't think they have UIs for non-SQL users. There are probably loads of 3rd party tools that offer this kind of interaction.

TrentonBush Aug 8, 2023

By "separate from the rest of PUDL" I mean an entirely separate system that requires a software middleware layer to connect to. I'd say our current CEMS data is separate from EIA/FERC data. I think everything should live in the same database system so that SQL can connect every part. If we additionally want to distribute the EIA/FERC data as a standalone sqlite db, that's fine, but it should be available in the same system as the CEMS/EQR/BigData stuff, wherever that might be.

TrentonBush Aug 8, 2023

There are probably loads of 3rd party tools that offer this kind of interaction

Can those kind of tools connect to "data at rest" via duckDB + httpfs or would they need a different connector?

zaneselvans · 2023-08-08T15:54:44Z

zaneselvans
Aug 8, 2023
Maintainer

Is it not possible/workable to query Parquet files in cloud storage directly with duckdb using HTTPFS? I think it can use the Parquet metadata to greatly reduce the data that needs to be transferred based on the query, assuming we can identify the most useful columns to index / partition on.

Is the EQR is really 100s of GB even when it's compressed? I don't know why I find this so hard to believe.

4 replies

bendnorman Aug 8, 2023
Maintainer Author

Ok cool I didn't know about the HTTPFS extension. It might work well if people are mostly interested in filtering the data. However, if people want to analyze all of the data, downloading all of EQR might be a hurdle.

The most recent quarters of EQR data as parquet files are about 4 GB each. 4 GB * 4 quarters * 10 years = 160 GB. Maybe there are some parquet tricks I'm not aware of that dramatically cut down the size.

zaneselvans Aug 8, 2023
Maintainer

Yeah if people really want to run queries that need to work with all the data then they'll have to work with cloud services or download hundreds of GB, no getting around it. In which case I think they have to either deal with the annoyance of a big download, or the cost + technical hurdle of working with cloud services.

My understanding is that storing the data "at rest" in Parquet is going to be much cheaper than having it in a live DB, and it's only storage + transfer costs, whereas with BQ or another column store there will also be costs per unit of data queried, and I suspect we can't reasonably provide access to 100s of GB publicly and cover the storage and query costs, whereas 160GB of data in a storage bucket is only $3.20/mo and if folks want to download it with requester pays they can! Is there a "requester pays" equivalent for public data in BigQuery?

So my guess is:

We can easily cover the cost of storing the Parquet in public, requester-pays storage buckets.
Parquet in buckets can be directly queried over HTTPFS effectively when the result-set is small and the information required to minimize data transfer is available in the Parquet metadata.
Anybody who wants to really work with the whole dataset will either need to download it in bulk and pay for the egress and work with it locally, or work with it on a cloud node in the same region with a 10Gbps network connection.
We can easily load that Parquet dataset into BigQuery (or an equivalent columnar DB), but would need to control access to the DB and ideally also use it to generate at least enough revenue to cover the associated storage & compute costs.

I think AWS Athena might allow more full-fledged queries of Parquet datasets than HTTPFS, but I don't think there's an equivalent service on GCP.

Of course we can't provide all data at any scale to everybody for free! If we put out whatever we can in public for free, and then have a private system in BQ or something similar that we use for client work and that clients can pay us to access and integrate directly and seamlessly with their BI or other tooling, that doesn't seem so bad.

TrentonBush Aug 8, 2023

BQ storage costs (~nothing) are borne by the host Project, and query costs by the billing account of the querying Project. So "requester pays" is the default in BQ.

Queries are free for the first 1 TB per month and you only pay for the columns/partitions a query touches. If BQ EQR is partitioned by, say, year, then each partition would be ~10 GB. That would allow each user to make ~100 select * from one_partition queries for free each month. If they subset the columns, they could make even more queries. Or copy the entire thing 10 times each month. Each TB additional costs $6.25

zaneselvans Aug 8, 2023
Maintainer

Oh that pricing model is great! I think my expectations were set by CloudSQL which seemed quite expensive even if the data was just sitting there. Seems like we could easily put the data both in Parquet and BQ then and populate one from the other whenever it gets updated.

TrentonBush · 2023-08-08T21:25:09Z

TrentonBush
Aug 8, 2023

I'm all for reducing hosting/cloud costs via "data at rest" architectures, but I'm concerned that it may:

not offer something close to feature parity of a cloud database, namely connecting external services without a lot of technical knowledge. Connecting viz/BI layers to PUDL would be such a huge step forward!
dump setup/maintenance/support work on us that more than offsets the cloud savings.

I'm not familiar with the duckDB setup so maybe it's fantastic and I'm just a stick-in-the-mud, but my ignorant intuition is that it will be more labor hours of work for us and far less supported by other tools like visualization layers, etc. I'd be really excited to get PUDL into a hosted database so that we can finally build other stuff on top if it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Larger data distribution #2770

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Catalyst Cooperative

Larger data distribution #2770

bendnorman Aug 5, 2023 Maintainer

DuckDB WASM on Cloud Run

DuckDB local + s3 bucket

Snowflake

AWS Redshift + s3 bucket

Google BigQuery Public Data

Other ideas

Replies: 3 comments · 7 replies

TrentonBush Aug 8, 2023

bendnorman Aug 8, 2023 Maintainer Author

TrentonBush Aug 8, 2023

TrentonBush Aug 8, 2023

zaneselvans Aug 8, 2023 Maintainer

bendnorman Aug 8, 2023 Maintainer Author

zaneselvans Aug 8, 2023 Maintainer

TrentonBush Aug 8, 2023

zaneselvans Aug 8, 2023 Maintainer

TrentonBush Aug 8, 2023

bendnorman
Aug 5, 2023
Maintainer

Replies: 3 comments 7 replies

TrentonBush
Aug 8, 2023

bendnorman Aug 8, 2023
Maintainer Author

zaneselvans
Aug 8, 2023
Maintainer

bendnorman Aug 8, 2023
Maintainer Author

zaneselvans Aug 8, 2023
Maintainer

zaneselvans Aug 8, 2023
Maintainer

TrentonBush
Aug 8, 2023