Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create pudl.duckdb from parquet files #3741

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions devtools/parquet_to_duckdb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
#! /usr/bin/env python
"""Script that creates a DuckDB database from a collection of PUDL Parquet files."""

import logging
from pathlib import Path

import click
import sqlalchemy as sa

from pudl.metadata import PUDL_PACKAGE
from pudl.metadata.classes import Package

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


@click.command()
@click.argument("parquet_dir", type=click.Path(exists=True, resolve_path=True))
@click.argument(
"duckdb_path", type=click.Path(resolve_path=True, writable=True, allow_dash=False)
)
def convert_parquet_to_duckdb(parquet_dir: str, duckdb_path: str):
"""Convert a directory of Parquet files to a DuckDB database.

Args:
parquet_dir: Path to a directory of parquet files.
duckdb_path: Path to the new DuckDB database file (should not exist).

Example:
python parquet_to_duckdb.py /path/to/parquet/directory duckdb.db
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further down the road do we want this functionality to stay outside of the core package? Or do we think we'll want to run it at the end of the ETL all the time? Would it make sense to put it in say pudl.convert.parquet_to_duckdb? and add the CLI to our entry points in pyproject.toml so we can test it along with all the other CLIs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's add it to pudl.convert so we can test it with our other CLIs.

parquet_dir = Path(parquet_dir)
duckdb_path = Path(duckdb_path)

# Check if DuckDB file already exists
if duckdb_path.exists():
click.echo(
f"Error: DuckDB file '{duckdb_path}' already exists. Please provide a new filename."
)
return

# create duck db schema from pudl package
resource_ids = (r.name for r in PUDL_PACKAGE.resources if len(r.name) <= 63)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be removed once we rename the tables to be less than 63 characters

package = Package.from_resource_ids(resource_ids)

metadata = package.to_sql(dialect="duckdb")
engine = sa.create_engine(f"duckdb:///{duckdb_path}")
metadata.create_all(engine)

# Iterate through the tables in order of foreign key dependency
for table in metadata.sorted_tables:
parquet_file_path = parquet_dir / f"{table.name}.parquet"
logger.info(f"Loading table: {table.name} into DuckDB")
if parquet_file_path.exists():
sql_command = f"""
COPY {table.name} FROM '{parquet_file_path}' (FORMAT PARQUET);
"""
with engine.connect() as conn:
conn.execute(sa.text(sql_command))
else:
print("File not found: ", parquet_file_path)
# raise FileNotFoundError("Parquet file not found for: ", table.name)


if __name__ == "__main__":
convert_parquet_to_duckdb()
47 changes: 24 additions & 23 deletions environments/conda-linux-64.lock.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading