Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create pudl.duckdb from parquet files #3741

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Jul 26, 2024

Overview

Closes #3739 .

What problem does this address?

What did you change?

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

@bendnorman bendnorman self-assigned this Jul 26, 2024
@bendnorman bendnorman added parquet Issues related to the Apache Parquet file format which we use for long tables. nightly-builds Anything having to do with nightly builds or continuous deployment. duckdb Issues referring to duckdb, the embedded OLAP database labels Jul 26, 2024
@@ -641,7 +641,7 @@ def to_pandas_dtype(self, compact: bool = False) -> str | pd.CategoricalDtype:
def to_sql_dtype(self) -> type: # noqa: A003
"""Return SQLAlchemy data type."""
if self.constraints.enum and self.type == "string":
return sa.Enum(*self.constraints.enum)
return sa.Enum(*self.constraints.enum, name=f"{self.name}_enum")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duckdb enum types require a name

Comment on lines +4998 to +5000
"core_eia__codes_sector_consolidated": {"code": {"type": "integer"}},
"core_eia__codes_steam_plant_types": {"code": {"type": "integer"}},
"core_eia__codes_wind_quality_class": {"code": {"type": "integer"}},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some type mismatches in these code tables.

return

# create duck db schema from pudl package
resource_ids = (r.name for r in PUDL_PACKAGE.resources if len(r.name) <= 63)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be removed once we rename the tables to be less than 63 characters

Comment on lines +712 to +717
# TODO: Not supporting enums with duckdb right now because
# code tables primary keys are VARCHARS but the tables that
# reference them have ENUM types. Duckdb throws an error
# because of the mismatched types. It is ok for now because
# the enums are checked in SQLite and the foreign key relationship
# acts as an enum constraint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the ENUM types in the tables that reference the coding tables over-specifying the values? Like are they doubly constrained, by both the ENUM and by the FK? I saw some ENUMs like that creep in a while ago but it seemed like they probably shouldn't be there -- just having the FK relationship should be sufficient to ensure that all the values are valid.

Copy link
Member Author

@bendnorman bendnorman Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are double-contained. This wasn't an issue with SQLite because the columns in both tables are TEXT types and we add a CHECK constraint to the enum columns to make sure all the values are valid.

Our Field.to_sql_dtype() method creates a sa.Enum type. Maybe we can just make enums an sa.String type? Do you know if we have any enum columns that aren't validated by a foreign key constraint?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. IIRC there's a big space savings for using ENUM in Parquet (maybe in DuckDB too?) in the big tables (where strings get dictionary encoded as ints). Maybe we just want to make both the PK and FK columns into ENUMs to satisfy DuckDB's (very reasonable) nitpickiness around dtypes? Even though that'll be over-constrained in terms of limiting the allowed values? Not sure what the best approach is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have ideas on how we can make the code field in the code tables an enum type using our metadata tooling? It feels tricky because the code field is shared among all the code tables, but they all have different enum values.

The parquet_to_duckdb script freezes after loading a few dozen tables. I removed the foreign key constraints from the database schema, and the data loaded just fine. I'm assuming the slowdown is happening because it's spending a bunch of time checking the sometimes 5 - 6 foreign key constraints on a table. I wonder if enabling the enums will help with this issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define the various different ENUM's for the code columns in all the different coding tables on a per-resource basis in the special cases at the bottom of the resource metadata files?

Comment on lines 23 to 32
def convert_parquet_to_duckdb(parquet_dir: str, duckdb_path: str):
"""Convert a directory of Parquet files to a DuckDB database.

Args:
parquet_dir: Path to a directory of parquet files.
duckdb_path: Path to the new DuckDB database file (should not exist).

Example:
python parquet_to_duckdb.py /path/to/parquet/directory duckdb.db
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further down the road do we want this functionality to stay outside of the core package? Or do we think we'll want to run it at the end of the ETL all the time? Would it make sense to put it in say pudl.convert.parquet_to_duckdb? and add the CLI to our entry points in pyproject.toml so we can test it along with all the other CLIs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's add it to pudl.convert so we can test it with our other CLIs.

@@ -2028,6 +2057,7 @@ def to_sql(
metadata,
check_types=check_types,
check_values=check_values,
dialect=dialect,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there are a number of resources (all the hourly ones which are currently parquet-only, so in CEMS, EIA-930, FERC-714, GridPathRAToolkit) that have create_database_schema==False but for the DuckDB output it seems like we probably do want all of those tables included in the unified database file since it's compressed and performant and would really be an all-in-one resource.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good reminder. I'll add some logic to to_sql to only filter out those tables if we're using sqlite.

@@ -282,6 +289,9 @@ if [[ $ETL_SUCCESS == 0 ]]; then
# Distribute Parquet outputs to a private bucket
distribute_parquet 2>&1 | tee -a "$LOGFILE"
DISTRIBUTE_PARQUET_SUCCESS=${PIPESTATUS[0]}
# Create duckdb file from parquet files
create_duckdb 2>&1 | tee -a "$LOGFILE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_duckdb() or create_and_distribute_duckdb()?

(I have been loving the ShellCheck linter for VSCode)

Comment on lines +712 to +717
# TODO: Not supporting enums with duckdb right now because
# code tables primary keys are VARCHARS but the tables that
# reference them have ENUM types. Duckdb throws an error
# because of the mismatched types. It is ok for now because
# the enums are checked in SQLite and the foreign key relationship
# acts as an enum constraint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define the various different ENUM's for the code columns in all the different coding tables on a per-resource basis in the special cases at the bottom of the resource metadata files?

name="parquet_to_duckdb",
context_settings={"help_option_names": ["-h", "--help"]},
)
@click.argument("parquet_dir", type=click.Path(exists=True, resolve_path=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can require that it be a directory here too.

)
@click.argument("parquet_dir", type=click.Path(exists=True, resolve_path=True))
@click.argument(
"duckdb_path", type=click.Path(resolve_path=True, writable=True, allow_dash=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected / desired behavior if the pudl.duckdb file already exists?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see below. I think you can require that it not already exist in this check too?

Comment on lines +51 to +52
check_fks: If true, enable foreign keys in the database. Currently,
the parquet load process freezes up when foreign keys are enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno if you checked which tables were grinding to a halt, but I'm curious what happens if you don't load any of the hourly tables, which have 10s to 100s of millions of rows -- especially core_epacems__hourly_emissions which is nearly 1 billion rows and definitely has several implied FK relationships.

@zaneselvans
Copy link
Member

zaneselvans commented Jul 28, 2024

I'm poking around at potential DB documentation tools in #3746 and noticed that the table and column comments don't seem to be included in the DuckDB schema that's currently getting produced.

After running:

parquet_to_duckdb --check-fks --no-load ../pudl-output/parquet ../pudl-output/pudl.duckdb

I opened up the DuckDB CLI and:

SELECT COMMENT FROM duckdb_tables() LIMIT 20;
SELECT COMMENT FROM duckdb_columns() LIMIT 20;

just gives emptiness.

And yet this does find the comments:

import pudl

pkg = pudl.metadata.classes.Package.from_resource_ids()
dude = pkg.to_sql(dialect="duckdb")
print(dude.tables["out_ferc714__summarized_demand"].comment)
# Compile FERC 714 annualized, categorized respondents and summarize values.

print(dude.tables["out_ferc714__summarized_demand"].columns[0].comment)
# Date reported.

@zaneselvans zaneselvans linked an issue Jul 28, 2024 that may be closed by this pull request
2 tasks
@zaneselvans
Copy link
Member

zaneselvans commented Jul 28, 2024

In the hopes of being able to more easily build database documentation programmatically (#3746 e.g. PUDL on dbdocs.io) I changed the DuckDB loading script to explicitly include table and column comments. Unfortunately they don't automatically come along for the ride with EXPORT DATABASE so I'm generating a schema.sql separately like:

import sqlalchemy as sa
from sqlalchemy.schema import CreateTable
from sqlalchemy.schema import SetColumnComment, SetTableComment

pudl_duckdb_engine = sa.create_engine("duckdb:////Users/zane/code/catalyst/pudl-output/pudl.duckdb")
md = sa.MetaData()
md.reflect(bind=pudl_duckdb_engine)

sql_statements = [
    "CREATE SCHEMA information_schema",
    "CREATE SCHEMA pg_catalog",
]
for table in md.sorted_tables:
    sql_statements.append(str(CreateTable(table)))
    sql_statements.append(str(SetTableComment(table)))
    for col in table.columns:
        sql_statements.append(str(SetColumnComment(col)))

with open ("schema.sql", "w") as f:
    for statement in sql_statements:
        f.write(statement.strip() + ";\n")

Made some small feature requests that would make propagating the comments easier:

Comment on lines +86 to +88
sql_command = f"""
COPY {table.name} FROM '{parquet_file_path}' (FORMAT PARQUET);
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without the foreign keys turned on, I got an out-of-memory error while it was attempting to load the EPA CEMS parquet (5GB on disk). Do we need to do some kind of chunking? Maybe by row-groups?

metadata.create_all(engine)

if not no_load:
with duckdb.connect(database=str(duckdb_path)) as duckdb_conn:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I messed around with this loop over the weekend to explicitly add the row and table comments to the schema, but later realized that:

  • there was a bug in Resource.to_sql() in that it didn't set the comment field.
  • the first and last columns that show up when you SELECT comment FROM duckdb_columns() aren't our columns -- they're duckdb internal stuff... and they have no comments. So it looked like there were no comments on the columns, but there actually were... so my explicit adding was unnecessary.

Thus, the only change that stuck around was switching to using a context manager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a journey - at least we learned something 😅 , thanks for doing the digging!

Comment on lines +1511 to +1513
return sa.Table(
self.name, metadata, *columns, *constraints, comment=self.description
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the comment field on the Table was not being set, so there were no Table level descriptions.

@zaneselvans
Copy link
Member

zaneselvans commented Jul 30, 2024

When I try and load all the parquet files into DuckDB with:

parquet_to_duckdb ../pudl-output/parquet ../pudl-output/pudl.duckdb

I consistently get a memory allocation error when it's trying to load the EPA CEMS. @bendnorman you said you've been able to get it to load everything successfully?

I tried shutting down all my memory intensive programs and it seemed to sit at ~16GB of memory for a while, creating temporary files alongside the database file, and then suddenly crashed, even though I never saw the memory usage get close to the maximum available, so I don't know what's going on there. Maybe it spiked so quickly I didn't see it in my system monitor? It says 25.5GB/25.5GB used.

INFO:pudl.convert.parquet_to_duckdb:Loading parquet file: /Users/zane/code/catalyst/pudl-output/parquet/core_epacems__hourly_emissions.parquet into /Users/zane/code/catalyst/pudl-output/pudl.duckdb
Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/bin/parquet_to_duckdb", line 8, in <module>
    sys.exit(parquet_to_duckdb())
             ^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/code/catalyst/pudl/src/pudl/convert/parquet_to_duckdb.py", line 76, in parquet_to_duckdb
    with duckdb.connect(database=str(duckdb_path)) as duckdb_conn:
  File "/Users/zane/code/catalyst/pudl/src/pudl/convert/parquet_to_duckdb.py", line 89, in parquet_to_duckdb
    duckdb_cursor.execute(sql_command)
duckdb.duckdb.OutOfMemoryException: Out of Memory Error: could not allocate block of size 256.0 KiB (25.5 GiB/25.5 GiB used)

It turns out I have my DuckDB memory limit set to 25.5 GB (not sure why) so it's crashing when it gets there, rather than grinding the machine to a halt. But with memory usage that high it seems like maybe it is not splitting the big parquet file up into smaller pieces for loading, which seems a little surprising.

I came across this issue and so tried setting the memory limit to 10GB instead of the default 25.5GB (80% of RAM) and it still failed, at about the same place, but with 10GB of memory usage instead of 25GB. The temporary files it created were smaller than before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duckdb Issues referring to duckdb, the embedded OLAP database nightly-builds Anything having to do with nightly builds or continuous deployment. parquet Issues related to the Apache Parquet file format which we use for long tables.
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

Load parquet files to a duckdb file
3 participants