Use dbt test and duckdb to validate our SQLite and Apache Parquet outputs #2477
zaneselvans
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
One of the many uses of
dbt
is running data validations, including checking for referential integrity in OLAP databases that do not typically maintain foreign key constraints like a transactional database would. Explicit referential integrity checks are in beta.dbt
comes with a built-in suite of tests, and community maintained packages add many more. Thedbt_utils
anddbt_expectations
packages look particularly relevant.The
dbt_expectations
package also contains some useful statistical checks, identical to the ones that we are already using in our home brewed data validations. For example there'sexpect_column_quantile_values_to_be_between
which allows a group-by, required to check things like whether the reported heat content per physical fuel unit is within expected ranges, when a single table contains many different kinds of fuels.It seems like we could potentially migrate all of our data validations and referential integrity checks to use
dbt
and speed them up dramatically by usingduckdb
as the query engine (even if we're not usingduckdb
for output) since it can read directly from both Apache Parquet and SQLite. Here's a demonstration that usesdbt test
andduckdb
for data validations.Using
dbt +
duckdb` like this would:pudl_check_fks
into the DAG.There's also been some recent work on improving the performance of SQLite on analytic workloads. See this post from Simon and this talk presenting the same paper
Beta Was this translation helpful? Give feedback.
All reactions