Create `pudl.duckdb` from parquet files #3741

bendnorman · 2024-07-26T03:02:26Z

Overview

Closes #3739 .

What problem does this address?

What did you change?

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Give feedback

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g., test_minmax_rows())
Update the release notes: reference the PR and related issues.
Ensure docs build, unit & integration tests, and test coverage pass locally with make pytest-coverage (otherwise the merge queue may reject your PR)
Review the PR yourself and call out any questions or issues you have
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For significant ETL, data coverage or analysis changes, once make pytest-coverage passes, ensure the full ETL runs locally and run data validation tests using make pytest-validate (a ~10 hour run). If you can't run this locally, run the build-deploy-pudl GitHub Action (or ask someone with permissions to). Then, check the logs on the #pudl-deployments Slack channel or gs://builds.catalyst.coop.
Options

bendnorman · 2024-07-26T03:03:40Z

src/pudl/metadata/classes.py

@@ -641,7 +641,7 @@ def to_pandas_dtype(self, compact: bool = False) -> str | pd.CategoricalDtype:
    def to_sql_dtype(self) -> type:  # noqa: A003
        """Return SQLAlchemy data type."""
        if self.constraints.enum and self.type == "string":
-            return sa.Enum(*self.constraints.enum)
+            return sa.Enum(*self.constraints.enum, name=f"{self.name}_enum")


Duckdb enum types require a name

bendnorman · 2024-07-26T03:04:49Z

src/pudl/metadata/fields.py

+    "core_eia__codes_sector_consolidated": {"code": {"type": "integer"}},
+    "core_eia__codes_steam_plant_types": {"code": {"type": "integer"}},
+    "core_eia__codes_wind_quality_class": {"code": {"type": "integer"}},


There were some type mismatches in these code tables.

bendnorman · 2024-07-26T03:05:26Z

devtools/parquet_to_duckdb.py

+        return
+
+    # create duck db schema from pudl package
+    resource_ids = (r.name for r in PUDL_PACKAGE.resources if len(r.name) <= 63)


This will be removed once we rename the tables to be less than 63 characters

zaneselvans · 2024-07-26T03:41:17Z

src/pudl/metadata/classes.py

+                # TODO: Not supporting enums with duckdb right now because
+                # code tables primary keys are VARCHARS but the tables that
+                # reference them have ENUM types. Duckdb throws an error
+                # because of the mismatched types. It is ok for now because
+                # the enums are checked in SQLite and the foreign key relationship
+                # acts as an enum constraint.


Are the ENUM types in the tables that reference the coding tables over-specifying the values? Like are they doubly constrained, by both the ENUM and by the FK? I saw some ENUMs like that creep in a while ago but it seemed like they probably shouldn't be there -- just having the FK relationship should be sufficient to ensure that all the values are valid.

Yes, they are double-contained. This wasn't an issue with SQLite because the columns in both tables are TEXT types and we add a CHECK constraint to the enum columns to make sure all the values are valid.

Our Field.to_sql_dtype() method creates a sa.Enum type. Maybe we can just make enums an sa.String type? Do you know if we have any enum columns that aren't validated by a foreign key constraint?

Hmm. IIRC there's a big space savings for using ENUM in Parquet (maybe in DuckDB too?) in the big tables (where strings get dictionary encoded as ints). Maybe we just want to make both the PK and FK columns into ENUMs to satisfy DuckDB's (very reasonable) nitpickiness around dtypes? Even though that'll be over-constrained in terms of limiting the allowed values? Not sure what the best approach is.

Do you have ideas on how we can make the code field in the code tables an enum type using our metadata tooling? It feels tricky because the code field is shared among all the code tables, but they all have different enum values.

The parquet_to_duckdb script freezes after loading a few dozen tables. I removed the foreign key constraints from the database schema, and the data loaded just fine. I'm assuming the slowdown is happening because it's spending a bunch of time checking the sometimes 5 - 6 foreign key constraints on a table. I wonder if enabling the enums will help with this issue?

Can we define the various different ENUM's for the code columns in all the different coding tables on a per-resource basis in the special cases at the bottom of the resource metadata files?

zaneselvans · 2024-07-26T03:49:03Z

devtools/parquet_to_duckdb.py

+def convert_parquet_to_duckdb(parquet_dir: str, duckdb_path: str):
+    """Convert a directory of Parquet files to a DuckDB database.
+
+    Args:
+        parquet_dir: Path to a directory of parquet files.
+        duckdb_path: Path to the new DuckDB database file (should not exist).
+
+    Example:
+        python parquet_to_duckdb.py /path/to/parquet/directory duckdb.db
+    """


Further down the road do we want this functionality to stay outside of the core package? Or do we think we'll want to run it at the end of the ETL all the time? Would it make sense to put it in say pudl.convert.parquet_to_duckdb? and add the CLI to our entry points in pyproject.toml so we can test it along with all the other CLIs?

Yeah let's add it to pudl.convert so we can test it with our other CLIs.

zaneselvans · 2024-07-26T03:56:59Z

src/pudl/metadata/classes.py

@@ -2028,6 +2057,7 @@ def to_sql(
                    metadata,
                    check_types=check_types,
                    check_values=check_values,
+                    dialect=dialect,


Note that there are a number of resources (all the hourly ones which are currently parquet-only, so in CEMS, EIA-930, FERC-714, GridPathRAToolkit) that have create_database_schema==False but for the DuckDB output it seems like we probably do want all of those tables included in the unified database file since it's compressed and performant and would really be an all-in-one resource.

Ah good reminder. I'll add some logic to to_sql to only filter out those tables if we're using sqlite.

Add option to disable foreign keys because it slows down the duckdb loading time

zaneselvans · 2024-07-26T21:41:54Z

docker/gcp_pudl_etl.sh

@@ -282,6 +289,9 @@ if [[ $ETL_SUCCESS == 0 ]]; then
        # Distribute Parquet outputs to a private bucket
        distribute_parquet 2>&1 | tee -a "$LOGFILE"
        DISTRIBUTE_PARQUET_SUCCESS=${PIPESTATUS[0]}
+        # Create duckdb file from parquet files
+        create_duckdb 2>&1 | tee -a "$LOGFILE"


create_duckdb() or create_and_distribute_duckdb()?

(I have been loving the ShellCheck linter for VSCode)

zaneselvans · 2024-07-26T21:43:55Z

src/pudl/metadata/classes.py

+                # TODO: Not supporting enums with duckdb right now because
+                # code tables primary keys are VARCHARS but the tables that
+                # reference them have ENUM types. Duckdb throws an error
+                # because of the mismatched types. It is ok for now because
+                # the enums are checked in SQLite and the foreign key relationship
+                # acts as an enum constraint.


Can we define the various different ENUM's for the code columns in all the different coding tables on a per-resource basis in the special cases at the bottom of the resource metadata files?

zaneselvans · 2024-07-26T21:46:37Z

src/pudl/convert/parquet_to_duckdb.py

+    name="parquet_to_duckdb",
+    context_settings={"help_option_names": ["-h", "--help"]},
+)
+@click.argument("parquet_dir", type=click.Path(exists=True, resolve_path=True))


I think you can require that it be a directory here too.

zaneselvans · 2024-07-26T21:47:24Z

src/pudl/convert/parquet_to_duckdb.py

+)
+@click.argument("parquet_dir", type=click.Path(exists=True, resolve_path=True))
+@click.argument(
+    "duckdb_path", type=click.Path(resolve_path=True, writable=True, allow_dash=False)


What's the expected / desired behavior if the pudl.duckdb file already exists?

Oh I see below. I think you can require that it not already exist in this check too?

zaneselvans · 2024-07-26T23:59:07Z

src/pudl/convert/parquet_to_duckdb.py

+        check_fks: If true, enable foreign keys in the database. Currently,
+            the parquet load process freezes up when foreign keys are enabled.


I dunno if you checked which tables were grinding to a halt, but I'm curious what happens if you don't load any of the hourly tables, which have 10s to 100s of millions of rows -- especially core_epacems__hourly_emissions which is nearly 1 billion rows and definitely has several implied FK relationships.

zaneselvans · 2024-07-28T18:34:40Z

I'm poking around at potential DB documentation tools in #3746 and noticed that the table and column comments don't seem to be included in the DuckDB schema that's currently getting produced.

After running:

parquet_to_duckdb --check-fks --no-load ../pudl-output/parquet ../pudl-output/pudl.duckdb

I opened up the DuckDB CLI and:

SELECT COMMENT FROM duckdb_tables() LIMIT 20;
SELECT COMMENT FROM duckdb_columns() LIMIT 20;

just gives emptiness.

And yet this does find the comments:

import pudl

pkg = pudl.metadata.classes.Package.from_resource_ids()
dude = pkg.to_sql(dialect="duckdb")
print(dude.tables["out_ferc714__summarized_demand"].comment)
# Compile FERC 714 annualized, categorized respondents and summarize values.

print(dude.tables["out_ferc714__summarized_demand"].columns[0].comment)
# Date reported.

zaneselvans · 2024-07-28T23:38:43Z

In the hopes of being able to more easily build database documentation programmatically (#3746 e.g. PUDL on dbdocs.io) I changed the DuckDB loading script to explicitly include table and column comments. Unfortunately they don't automatically come along for the ride with EXPORT DATABASE so I'm generating a schema.sql separately like:

import sqlalchemy as sa
from sqlalchemy.schema import CreateTable
from sqlalchemy.schema import SetColumnComment, SetTableComment

pudl_duckdb_engine = sa.create_engine("duckdb:////Users/zane/code/catalyst/pudl-output/pudl.duckdb")
md = sa.MetaData()
md.reflect(bind=pudl_duckdb_engine)

sql_statements = [
    "CREATE SCHEMA information_schema",
    "CREATE SCHEMA pg_catalog",
]
for table in md.sorted_tables:
    sql_statements.append(str(CreateTable(table)))
    sql_statements.append(str(SetTableComment(table)))
    for col in table.columns:
        sql_statements.append(str(SetColumnComment(col)))

with open ("schema.sql", "w") as f:
    for statement in sql_statements:
        f.write(statement.strip() + ";\n")

Made some small feature requests that would make propagating the comments easier:

zaneselvans · 2024-07-29T16:15:37Z

src/pudl/convert/parquet_to_duckdb.py

+                    sql_command = f"""
+                        COPY {table.name} FROM '{parquet_file_path}' (FORMAT PARQUET);
+                    """


Even without the foreign keys turned on, I got an out-of-memory error while it was attempting to load the EPA CEMS parquet (5GB on disk). Do we need to do some kind of chunking? Maybe by row-groups?

zaneselvans · 2024-07-29T16:18:05Z

src/pudl/convert/parquet_to_duckdb.py

+    metadata.create_all(engine)
+
+    if not no_load:
+        with duckdb.connect(database=str(duckdb_path)) as duckdb_conn:


I messed around with this loop over the weekend to explicitly add the row and table comments to the schema, but later realized that:

there was a bug in Resource.to_sql() in that it didn't set the comment field.

the first and last columns that show up when you SELECT comment FROM duckdb_columns() aren't our columns -- they're duckdb internal stuff... and they have no comments. So it looked like there were no comments on the columns, but there actually were... so my explicit adding was unnecessary.

Thus, the only change that stuck around was switching to using a context manager.

What a journey - at least we learned something 😅 , thanks for doing the digging!

zaneselvans · 2024-07-29T16:19:58Z

src/pudl/metadata/classes.py

+        return sa.Table(
+            self.name, metadata, *columns, *constraints, comment=self.description
+        )


Previously the comment field on the Table was not being set, so there were no Table level descriptions.

zaneselvans · 2024-07-30T04:34:54Z

When I try and load all the parquet files into DuckDB with:

parquet_to_duckdb ../pudl-output/parquet ../pudl-output/pudl.duckdb

I consistently get a memory allocation error when it's trying to load the EPA CEMS. @bendnorman you said you've been able to get it to load everything successfully?

I tried shutting down all my memory intensive programs and it seemed to sit at ~16GB of memory for a while, creating temporary files alongside the database file, and then suddenly crashed, even though I never saw the memory usage get close to the maximum available, so I don't know what's going on there. Maybe it spiked so quickly I didn't see it in my system monitor? It says 25.5GB/25.5GB used.

INFO:pudl.convert.parquet_to_duckdb:Loading parquet file: /Users/zane/code/catalyst/pudl-output/parquet/core_epacems__hourly_emissions.parquet into /Users/zane/code/catalyst/pudl-output/pudl.duckdb
Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/bin/parquet_to_duckdb", line 8, in <module>
    sys.exit(parquet_to_duckdb())
             ^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/code/catalyst/pudl/src/pudl/convert/parquet_to_duckdb.py", line 76, in parquet_to_duckdb
    with duckdb.connect(database=str(duckdb_path)) as duckdb_conn:
  File "/Users/zane/code/catalyst/pudl/src/pudl/convert/parquet_to_duckdb.py", line 89, in parquet_to_duckdb
    duckdb_cursor.execute(sql_command)
duckdb.duckdb.OutOfMemoryException: Out of Memory Error: could not allocate block of size 256.0 KiB (25.5 GiB/25.5 GiB used)

It turns out I have my DuckDB memory limit set to 25.5 GB (not sure why) so it's crashing when it gets there, rather than grinding the machine to a halt. But with memory usage that high it seems like maybe it is not splitting the big parquet file up into smaller pieces for loading, which seems a little surprising.

I came across this issue and so tried setting the memory limit to 10GB instead of the default 25.5GB (80% of RAM) and it still failed, at about the same place, but with 10GB of memory usage instead of 25GB. The temporary files it created were smaller than before.

bendnorman added 4 commits July 25, 2024 16:09

Working duckdb schema creation without constraints or enums

11faecb

Add check contraints to duckdb dialect

e090808

Load parquet files into existing duckdb file

218460d

Merge branch 'main' into create-duckdb-output

6649cfd

bendnorman self-assigned this Jul 26, 2024

bendnorman added parquet Issues related to the Apache Parquet file format which we use for long tables. nightly-builds Anything having to do with nightly builds or continuous deployment. duckdb Issues referring to duckdb, the embedded OLAP database labels Jul 26, 2024

bendnorman commented Jul 26, 2024

View reviewed changes

Add code pk type migration

7ed930f

zaneselvans reviewed Jul 26, 2024

View reviewed changes

bendnorman added 3 commits July 26, 2024 11:39

Filter out large tables for sqlite but include for duckdb

c94abf7

Add option to disable foreign keys because it slows down the duckdb loading time

Merge branch 'main' into create-duckdb-output

0eee1e1

Add parquet_to_duckdb cli to package and add to nightly build script

c727735

zaneselvans reviewed Jul 26, 2024

View reviewed changes

zaneselvans added 2 commits July 27, 2024 12:53

Merge branch 'main' into create-duckdb-output

75883dc

Update conda lockfiles after merging main.

7344506

zaneselvans added 2 commits July 28, 2024 14:47

Add table comments in SQL schemas.

6662078

Include table and column comments in DuckDB schema

7aa56d7

zaneselvans linked an issue Jul 28, 2024 that may be closed by this pull request

Load parquet files to a duckdb file #3739

Open

2 tasks

zaneselvans added 2 commits July 28, 2024 22:45

Lol jk it was putting comments in there all along.

9fbf4b2

Swap conditional and context manager order for parquet loading

98a137b

zaneselvans reviewed Jul 29, 2024

View reviewed changes

zaneselvans added 2 commits July 29, 2024 20:12

Merge branch 'main' into create-duckdb-output

faeae02

Merge main and update dependencies b/c both branches had changes.

f0b61f6

Merge alembic migrations from this branch and main

221ce77

jdangerx mentioned this pull request Aug 26, 2024

Handle high memory usage in CEMS parquet-to-duckdb conversion #3812

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create `pudl.duckdb` from parquet files #3741

Create `pudl.duckdb` from parquet files #3741

bendnorman commented Jul 26, 2024 •

edited by jdangerx

Loading

To-do list

bendnorman Jul 26, 2024

bendnorman Jul 26, 2024

bendnorman Jul 26, 2024

zaneselvans Jul 26, 2024

bendnorman Jul 26, 2024 •

edited

Loading

zaneselvans Jul 26, 2024

bendnorman Jul 26, 2024

zaneselvans Jul 26, 2024

zaneselvans Jul 26, 2024

bendnorman Jul 26, 2024

zaneselvans Jul 26, 2024

bendnorman Jul 26, 2024

zaneselvans Jul 26, 2024

zaneselvans Jul 26, 2024

zaneselvans Jul 26, 2024

zaneselvans Jul 26, 2024

zaneselvans Jul 26, 2024

zaneselvans Jul 26, 2024

zaneselvans commented Jul 28, 2024 •

edited

Loading

zaneselvans commented Jul 28, 2024 •

edited

Loading

zaneselvans Jul 29, 2024

zaneselvans Jul 29, 2024

jdangerx Aug 6, 2024

zaneselvans Jul 29, 2024

zaneselvans commented Jul 30, 2024 •

edited

Loading

		check_fks: If true, enable foreign keys in the database. Currently,
		the parquet load process freezes up when foreign keys are enabled.

Create pudl.duckdb from parquet files #3741

Are you sure you want to change the base?

Create pudl.duckdb from parquet files #3741

Conversation

bendnorman commented Jul 26, 2024 • edited by jdangerx Loading

Overview

Testing

To-do list

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendnorman Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Jul 28, 2024 • edited Loading

zaneselvans commented Jul 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Jul 30, 2024 • edited Loading

Create `pudl.duckdb` from parquet files #3741

Create `pudl.duckdb` from parquet files #3741

bendnorman commented Jul 26, 2024 •

edited by jdangerx

Loading

bendnorman Jul 26, 2024 •

edited

Loading

zaneselvans commented Jul 28, 2024 •

edited

Loading

zaneselvans commented Jul 28, 2024 •

edited

Loading

zaneselvans commented Jul 30, 2024 •

edited

Loading