Clone incremental models as the first step of your CI job #4359

graciegoheen · 2023-10-27T16:51:41Z

Contributions

I have read the contribution docs, and understand what's expected of me.

Link to the page on docs.getdbt.com requiring updates

Probably this one makes the most sense https://docs.getdbt.com/docs/deploy/ci-jobs

Could also be a blog post!

Or part of some best practice guide?

Main content

Imagine that you've created a Slim CI job in dbt Cloud.

Your CI job:

defers to your production environment
runs the command dbt build --select state:modified+ to run and test all of the models you've modified and their downstream dependencies
is triggered whenever a developer on your team opens a PR against the main branch

Now imagine your dbt project looks like this:

When you open a PR that modifies dim_wizards, your CI job will kickoff and build only the modified models and their downstream dependencies (in this case: dim_wizards and fct_orders) into a temporary schema that's unique to your PR.

This build mimics the behavior of what will happen once the PR is merged into the main branch (so you have confidence that you're not introducing breaking changes), without requiring a build of your entire dbt project.

But what happens when one of the modified models (or one of their downstream dependencies) is an incremental model?

Because your CI job is building modified models into a PR-specific schema, on the first execution of dbt build --select state:modified+ the modified incremental model will be built in its entirety because it does not yet exist in the PR-specific schema aka is_incremental will be false. You're running in full-refresh mode.

This can be suboptimal because:

typically incremental models are your largest datasets, so they take a long time to build in their entirety which can slow down development time and incur warehouse $$$
there are situations where a full-refresh of the incremental model passes successfully in your CI job but an incremental build of that same table in prod would fail when the PR is merged into main (think schema drift where on_schema_change config is set to fail)

We can alleviate the above problems by zero copy cloning the relevant, pre-exisitng incremental models into our PR-specific schema as the first step of the CI job using the dbt clone command. This way, the incremental models already exist in the PR-specific schema when you first execute the command dbt build --select state:modified+ so the is_incremental flag will be true.

Now, we'll have 2 commands for our dbt Cloud CI check to execute:

Clone all of the pre-existing, incremental models that have been modified or are downstream of another model that has been modified -> dbt clone --select state:modified+,config.materialized:incremental,state:old
Build all of models that have been modified and their downstream dependencies dbt build --select state:modified+

Because of our first clone step, the incremental models selected in our dbt build in the second step will run in incremental mode.

Your CI jobs will run faster, and you're more accurately mimicking the behavior of "what will happen once the PR has been merged into main".

Disclaimers:

dbt clone is only available with dbt version 1.6+
this strategy only works for warehouse that support zero copy cloning (otherwise dbt clone will just create pointer views)
some teams may want to test that their incremental models run in both incremental mode and full-refresh mode

Additional information

Relevant slack thread: https://dbt-labs.slack.com/archives/C05FWBP9X1U/p1692830261651829

From my "Better CI for better data quality coalesce talk:

If you use the incremental materialization in your dbt project, you should consider cloning your relevant, pre-existing incremental models into your PR-specific schema as the first step of your CI check. This will force your second step to run in incremental mode (where is_incremental is true) because now the models already exist in your PR-specific schema (via cloning). This is beneficial because it more accurately mimics what will happen when you merge your changes into production and it will save time and money by not rebuilding your incremental models (which are often large data sets) from scratch for every PR that modifies them.

Expansion on "think schema drift where on_schema_change config is set to fail" from above:
Let’s imagine you have an incremental model my_incremental_model with the following config:

{{
    config(
        materialized='incremental',
        unique_key='unique_id',
        on_schema_change='fail'
    )
}}

Now, let’s say I open up a PR that adds a new column to my_incremental_model - in this case:

an incremental build will fail
a full-refresh will succeed

If you have a daily production job that just executes a dbt build (without a --full-refresh flag), once the PR is merged into main and the job kicks off, you will get a failure. So the question is - what do you want to happen in CI?

Do you want to also get a failure in CI, so that you know that once this PR is merged into main you need to immediately execute a dbt build --full-refresh --select my_incremental_model in production in order to avoid a failure in prod? This will block your CI check from passing.
Do you want your CI check to succeed, because once you do run a full-refresh for this model in prod you will be in a successful state? This may lead to you being surprised that your production job is suddenly failing when you merge this PR into main because you didn’t realize you would need to execute a dbt build --full-refresh --select my_incremental_model in production.

Probably not a perfect solution here, it’s all just tradeoffs! Personally, I'd rather have the failing CI job and have to manually override the blocking branch protection rule so that I'm not surprised and can proactively run the appropriate command in production once I merge the PR in.

Expansion on "why state:old":
For brand new incremental models we actually want those to run in full-refresh mode in CI, because they will run in full-refresh mode in production when the PR is merged into main because they also don't exist yet in the production environment... they're brand new!
If you don't specify this, you won't get an error just a “No relation found in state manifest for…” - so it technically works with our without specifying state:old. But adding state:old is more explicit and means it won't even try to clone the brand new incremental models.

The text was updated successfully, but these errors were encountered:

runleonarun · 2023-10-27T17:46:40Z

@graciegoheen I think this information would make a great Best Practices guide. @joellabes would you be able to review once the docs team gets it ready?

Like you suggested, I also think we should also talk about how dbt clone can help handle incremental models in CI in the "Continuous Integration jobs in dbt Cloud" page and possibly in the dbt Clone command page.

matt-winkler · 2023-10-29T19:23:29Z

Just a comment that I read this excellent post and fully agree with all points made.

joellabes · 2023-10-30T01:06:10Z

Yes lmk when you're ready for me! excited to get this up into the world

## What are you changing in this pull request and why? Closes issue #4359 ## Checklist - [x] Review the [Content style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md) so my content adheres to these guidelines. - [x] For [docs versioning](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#about-versioning), review how to [version a whole page](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version) and [version a block of content](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-blocks-of-content). - [x] Add a checklist item for anything that needs to happen before this PR is merged, such as "needs technical review" or "change base branch." Adding new pages (delete if not applicable): - [x] Add page to `website/sidebars.js` - [x] Provide a unique filename for the new page

graciegoheen · 2024-05-21T19:58:01Z

@matthewshaver is this closed?

mirnawong1 · 2024-07-01T09:56:27Z

hey team, looks like this was addressed in #4542. closing this issue but let me know if this is wrong! thank you!

graciegoheen added content Improvements or additions to content improvement Use this when an area of the docs needs improvement as it's currently unclear labels Oct 27, 2023

graciegoheen mentioned this issue Oct 27, 2023

Blog post: to defer or to clone (Publish Tues 10/31) #4288

Merged

2 tasks

runleonarun assigned nghi-ly and runleonarun Oct 27, 2023

matthewshaver self-assigned this Oct 31, 2023

runleonarun unassigned nghi-ly Nov 3, 2023

runleonarun added the priority: high Technical inaccuracy, missing/incorrect information, or broken links. Negatively affects workflows label Nov 15, 2023

runleonarun removed their assignment Nov 27, 2023

runleonarun added the best practice label Nov 27, 2023

matthewshaver mentioned this issue Nov 28, 2023

New best practice guide for clone #4542

Merged

5 tasks

mirnawong1 closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clone incremental models as the first step of your CI job #4359

Clone incremental models as the first step of your CI job #4359

graciegoheen commented Oct 27, 2023 •

edited

Loading

runleonarun commented Oct 27, 2023 •

edited

Loading

matt-winkler commented Oct 29, 2023

joellabes commented Oct 30, 2023

graciegoheen commented May 21, 2024

mirnawong1 commented Jul 1, 2024

Clone incremental models as the first step of your CI job #4359

Clone incremental models as the first step of your CI job #4359

Comments

graciegoheen commented Oct 27, 2023 • edited Loading

Contributions

Link to the page on docs.getdbt.com requiring updates

Main content

Additional information

runleonarun commented Oct 27, 2023 • edited Loading

matt-winkler commented Oct 29, 2023

joellabes commented Oct 30, 2023

graciegoheen commented May 21, 2024

mirnawong1 commented Jul 1, 2024

graciegoheen commented Oct 27, 2023 •

edited

Loading

runleonarun commented Oct 27, 2023 •

edited

Loading