Larger data distribution #2770
Replies: 3 comments 7 replies
-
Are these systems separate from the rest of PUDL, or is everything going to be in one place?
|
Beta Was this translation helpful? Give feedback.
-
Is it not possible/workable to query Parquet files in cloud storage directly with duckdb using HTTPFS? I think it can use the Parquet metadata to greatly reduce the data that needs to be transferred based on the query, assuming we can identify the most useful columns to index / partition on. Is the EQR is really 100s of GB even when it's compressed? I don't know why I find this so hard to believe. |
Beta Was this translation helpful? Give feedback.
-
I'm all for reducing hosting/cloud costs via "data at rest" architectures, but I'm concerned that it may:
I'm not familiar with the duckDB setup so maybe it's fantastic and I'm just a stick-in-the-mud, but my ignorant intuition is that it will be more labor hours of work for us and far less supported by other tools like visualization layers, etc. I'd be really excited to get PUDL into a hosted database so that we can finally build other stuff on top if it. |
Beta Was this translation helpful? Give feedback.
-
I've been thinking about how to distribute FERC EQR, which will probably be > 100 GB of parquet files. Ideally, there'd be a solution that:
This is hard! Either users will have to be comfortable downloading > 100 GBs of data and working with tools like dask and duckdb or be comfortable setting up, managing and using cloud infrastructure. Here are some options:
DuckDB WASM on Cloud Run
DuckDB local + s3 bucket
Snowflake
AWS Redshift + s3 bucket
Google BigQuery Public Data
Other ideas
My dream would be to host a duckdb WASM interface with the data pre-loaded, though based on my understanding of WASM, I don't think it can use local compute on remote data. The 2 GB memory limit is also a bummer.
The answer to this question probably depends on how people are interested in using the data. If folks just want some seller information for a given quarter, they can easily download one quarters worth of data and explore it using dask or duckdb. If they want to analyze all 10 years of data, they'll probably need a cloud provider.
Another consideration here is our storage limit on the AWS bucket. We only get 100 GB of storage though I think they will grant us 1 TB. Even with 1 TB of storage, we could only have about 4 data release versions at a time if we're distributing EQR, which might end up being about 100 - 200 GB.
Beta Was this translation helpful? Give feedback.
All reactions