-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create pudl.duckdb
from parquet files
#3741
base: main
Are you sure you want to change the base?
Conversation
@@ -641,7 +641,7 @@ def to_pandas_dtype(self, compact: bool = False) -> str | pd.CategoricalDtype: | |||
def to_sql_dtype(self) -> type: # noqa: A003 | |||
"""Return SQLAlchemy data type.""" | |||
if self.constraints.enum and self.type == "string": | |||
return sa.Enum(*self.constraints.enum) | |||
return sa.Enum(*self.constraints.enum, name=f"{self.name}_enum") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duckdb enum types require a name
"core_eia__codes_sector_consolidated": {"code": {"type": "integer"}}, | ||
"core_eia__codes_steam_plant_types": {"code": {"type": "integer"}}, | ||
"core_eia__codes_wind_quality_class": {"code": {"type": "integer"}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were some type mismatches in these code tables.
devtools/parquet_to_duckdb.py
Outdated
return | ||
|
||
# create duck db schema from pudl package | ||
resource_ids = (r.name for r in PUDL_PACKAGE.resources if len(r.name) <= 63) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be removed once we rename the tables to be less than 63 characters
# TODO: Not supporting enums with duckdb right now because | ||
# code tables primary keys are VARCHARS but the tables that | ||
# reference them have ENUM types. Duckdb throws an error | ||
# because of the mismatched types. It is ok for now because | ||
# the enums are checked in SQLite and the foreign key relationship | ||
# acts as an enum constraint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the ENUM types in the tables that reference the coding tables over-specifying the values? Like are they doubly constrained, by both the ENUM and by the FK? I saw some ENUMs like that creep in a while ago but it seemed like they probably shouldn't be there -- just having the FK relationship should be sufficient to ensure that all the values are valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are double-contained. This wasn't an issue with SQLite because the columns in both tables are TEXT types and we add a CHECK constraint to the enum columns to make sure all the values are valid.
Our Field.to_sql_dtype()
method creates a sa.Enum
type. Maybe we can just make enums an sa.String
type? Do you know if we have any enum columns that aren't validated by a foreign key constraint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. IIRC there's a big space savings for using ENUM in Parquet (maybe in DuckDB too?) in the big tables (where strings get dictionary encoded as ints). Maybe we just want to make both the PK and FK columns into ENUMs to satisfy DuckDB's (very reasonable) nitpickiness around dtypes? Even though that'll be over-constrained in terms of limiting the allowed values? Not sure what the best approach is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have ideas on how we can make the code
field in the code tables an enum type using our metadata tooling? It feels tricky because the code
field is shared among all the code tables, but they all have different enum values.
The parquet_to_duckdb
script freezes after loading a few dozen tables. I removed the foreign key constraints from the database schema, and the data loaded just fine. I'm assuming the slowdown is happening because it's spending a bunch of time checking the sometimes 5 - 6 foreign key constraints on a table. I wonder if enabling the enums will help with this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define the various different ENUM's for the code
columns in all the different coding tables on a per-resource basis in the special cases at the bottom of the resource metadata files?
devtools/parquet_to_duckdb.py
Outdated
def convert_parquet_to_duckdb(parquet_dir: str, duckdb_path: str): | ||
"""Convert a directory of Parquet files to a DuckDB database. | ||
|
||
Args: | ||
parquet_dir: Path to a directory of parquet files. | ||
duckdb_path: Path to the new DuckDB database file (should not exist). | ||
|
||
Example: | ||
python parquet_to_duckdb.py /path/to/parquet/directory duckdb.db | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further down the road do we want this functionality to stay outside of the core package? Or do we think we'll want to run it at the end of the ETL all the time? Would it make sense to put it in say pudl.convert.parquet_to_duckdb
? and add the CLI to our entry points in pyproject.toml
so we can test it along with all the other CLIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's add it to pudl.convert
so we can test it with our other CLIs.
src/pudl/metadata/classes.py
Outdated
@@ -2028,6 +2057,7 @@ def to_sql( | |||
metadata, | |||
check_types=check_types, | |||
check_values=check_values, | |||
dialect=dialect, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there are a number of resources (all the hourly ones which are currently parquet-only, so in CEMS, EIA-930, FERC-714, GridPathRAToolkit) that have create_database_schema==False
but for the DuckDB output it seems like we probably do want all of those tables included in the unified database file since it's compressed and performant and would really be an all-in-one resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good reminder. I'll add some logic to to_sql
to only filter out those tables if we're using sqlite.
Add option to disable foreign keys because it slows down the duckdb loading time
@@ -282,6 +289,9 @@ if [[ $ETL_SUCCESS == 0 ]]; then | |||
# Distribute Parquet outputs to a private bucket | |||
distribute_parquet 2>&1 | tee -a "$LOGFILE" | |||
DISTRIBUTE_PARQUET_SUCCESS=${PIPESTATUS[0]} | |||
# Create duckdb file from parquet files | |||
create_duckdb 2>&1 | tee -a "$LOGFILE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create_duckdb()
or create_and_distribute_duckdb()
?
(I have been loving the ShellCheck linter for VSCode)
# TODO: Not supporting enums with duckdb right now because | ||
# code tables primary keys are VARCHARS but the tables that | ||
# reference them have ENUM types. Duckdb throws an error | ||
# because of the mismatched types. It is ok for now because | ||
# the enums are checked in SQLite and the foreign key relationship | ||
# acts as an enum constraint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define the various different ENUM's for the code
columns in all the different coding tables on a per-resource basis in the special cases at the bottom of the resource metadata files?
name="parquet_to_duckdb", | ||
context_settings={"help_option_names": ["-h", "--help"]}, | ||
) | ||
@click.argument("parquet_dir", type=click.Path(exists=True, resolve_path=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can require that it be a directory here too.
) | ||
@click.argument("parquet_dir", type=click.Path(exists=True, resolve_path=True)) | ||
@click.argument( | ||
"duckdb_path", type=click.Path(resolve_path=True, writable=True, allow_dash=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected / desired behavior if the pudl.duckdb
file already exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see below. I think you can require that it not already exist in this check too?
check_fks: If true, enable foreign keys in the database. Currently, | ||
the parquet load process freezes up when foreign keys are enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno if you checked which tables were grinding to a halt, but I'm curious what happens if you don't load any of the hourly tables, which have 10s to 100s of millions of rows -- especially core_epacems__hourly_emissions
which is nearly 1 billion rows and definitely has several implied FK relationships.
I'm poking around at potential DB documentation tools in #3746 and noticed that the table and column comments don't seem to be included in the DuckDB schema that's currently getting produced. After running: parquet_to_duckdb --check-fks --no-load ../pudl-output/parquet ../pudl-output/pudl.duckdb I opened up the DuckDB CLI and: SELECT COMMENT FROM duckdb_tables() LIMIT 20;
SELECT COMMENT FROM duckdb_columns() LIMIT 20; just gives emptiness. And yet this does find the comments: import pudl
pkg = pudl.metadata.classes.Package.from_resource_ids()
dude = pkg.to_sql(dialect="duckdb")
print(dude.tables["out_ferc714__summarized_demand"].comment)
# Compile FERC 714 annualized, categorized respondents and summarize values.
print(dude.tables["out_ferc714__summarized_demand"].columns[0].comment)
# Date reported. |
In the hopes of being able to more easily build database documentation programmatically (#3746 e.g. PUDL on dbdocs.io) I changed the DuckDB loading script to explicitly include table and column comments. Unfortunately they don't automatically come along for the ride with import sqlalchemy as sa
from sqlalchemy.schema import CreateTable
from sqlalchemy.schema import SetColumnComment, SetTableComment
pudl_duckdb_engine = sa.create_engine("duckdb:////Users/zane/code/catalyst/pudl-output/pudl.duckdb")
md = sa.MetaData()
md.reflect(bind=pudl_duckdb_engine)
sql_statements = [
"CREATE SCHEMA information_schema",
"CREATE SCHEMA pg_catalog",
]
for table in md.sorted_tables:
sql_statements.append(str(CreateTable(table)))
sql_statements.append(str(SetTableComment(table)))
for col in table.columns:
sql_statements.append(str(SetColumnComment(col)))
with open ("schema.sql", "w") as f:
for statement in sql_statements:
f.write(statement.strip() + ";\n") Made some small feature requests that would make propagating the comments easier: |
sql_command = f""" | ||
COPY {table.name} FROM '{parquet_file_path}' (FORMAT PARQUET); | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without the foreign keys turned on, I got an out-of-memory error while it was attempting to load the EPA CEMS parquet (5GB on disk). Do we need to do some kind of chunking? Maybe by row-groups?
metadata.create_all(engine) | ||
|
||
if not no_load: | ||
with duckdb.connect(database=str(duckdb_path)) as duckdb_conn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I messed around with this loop over the weekend to explicitly add the row and table comments to the schema, but later realized that:
- there was a bug in
Resource.to_sql()
in that it didn't set the comment field. - the first and last columns that show up when you
SELECT comment FROM duckdb_columns()
aren't our columns -- they're duckdb internal stuff... and they have no comments. So it looked like there were no comments on the columns, but there actually were... so my explicit adding was unnecessary.
Thus, the only change that stuck around was switching to using a context manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a journey - at least we learned something 😅 , thanks for doing the digging!
return sa.Table( | ||
self.name, metadata, *columns, *constraints, comment=self.description | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the comment field on the Table was not being set, so there were no Table level descriptions.
When I try and load all the parquet files into DuckDB with: parquet_to_duckdb ../pudl-output/parquet ../pudl-output/pudl.duckdb I consistently get a memory allocation error when it's trying to load the EPA CEMS. @bendnorman you said you've been able to get it to load everything successfully? I tried shutting down all my memory intensive programs and it seemed to sit at ~16GB of memory for a while, creating temporary files alongside the database file, and then suddenly crashed, even though I never saw the memory usage get close to the maximum available, so I don't know what's going on there. Maybe it spiked so quickly I didn't see it in my system monitor? It says 25.5GB/25.5GB used.
It turns out I have my DuckDB memory limit set to 25.5 GB (not sure why) so it's crashing when it gets there, rather than grinding the machine to a halt. But with memory usage that high it seems like maybe it is not splitting the big parquet file up into smaller pieces for loading, which seems a little surprising. I came across this issue and so tried setting the memory limit to 10GB instead of the default 25.5GB (80% of RAM) and it still failed, at about the same place, but with 10GB of memory usage instead of 25GB. The temporary files it created were smaller than before. |
Overview
Closes #3739 .
What problem does this address?
What did you change?
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list