Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Models - unique_key possibly has different behavior than stated #4355

Closed
1 task done
fathom-parth opened this issue Oct 27, 2023 · 9 comments
Closed
1 task done
Labels
content Improvements or additions to content improvement Use this when an area of the docs needs improvement as it's currently unclear

Comments

@fathom-parth
Copy link

Contributions

  • I have read the contribution docs, and understand what's expected of me.

Link to the page on docs.getdbt.com requiring updates

https://docs.getdbt.com/docs/build/incremental-models#defining-a-unique-key-optional

What part(s) of the page would you like to see updated?

Context

I'd like to use incremental models to replace all records with a specific timestamp (an ingest timestamp) with the new rows coming in.

When reading over the dbt incremental docs, it seems the merge strategy would be the best for this; however, this strategy requires a unique_key.

The Problem

In this case, there's no unique_key that I want to define per record and I'd rather have the incremental strat remove every row related to a specific column value (in my case, an ingest timestamp).

I asked this question on slack here:
https://getdbt.slack.com/archives/CBSQTAPLG/p1698417729305499

And some people mentioned that they do this by defining a unique_key that's not actually unique per-record and that this works fine for them.

One of the users on slack also mention:

The docs are misleading.. and the naming of the config is also confusing. I’m sure if you search slack for unique_key you’ll see many people mention what Daron and I suggested.

The documentation seems to conflict with the above suggestion however by stating:

The optional unique_key parameter specifies a field (or combination of fields) that define the grain of your model. That is, the field(s) identify a single unique row. You can define unique_key in a configuration block at the top of your model, and it can be a single column name or a list of column names.

and

Please note that if there's a unique_key with more than one row in either the existing target table or the new incremental rows, the incremental model may fail depending on your database and incremental strategy.

This conflict is confusing to me and I'd like to know if the docs need to be updated or if the users' experience is undefined.

Additional information

For reference, I'm using the bigquery adapter.

@fathom-parth fathom-parth added content Improvements or additions to content improvement Use this when an area of the docs needs improvement as it's currently unclear labels Oct 27, 2023
@fathom-parth
Copy link
Author

@mirnawong1 Is this the right place for me to open this issue? Is there a different repo that would get this more traction? Trying to see if my organization can use this feature or if this is undefined behavior

@mirnawong1
Copy link
Contributor

hey @fathom-parth this is def the right place. I'm going to consult with our developer experience team to see how to best address this for you. thank you for flagging this and apologies for the delay here!

@dbeatty10
Copy link
Contributor

Take a look at the insert_overwrite strategy

@fathom-parth If you are using dbt-bigquery, what you are describing sounds like a good use-case for the insert_overwrite strategy.

You can read commentary about the insert_overwrite strategy from other practitioners here:

Reports from other users on Slack

We'd need to learn more about their specific details before making updates to our documentation.

I don't know one way or the other how defining a unique_key that's not actually unique would behave without knowing which dbt adapter they are using and seeing an example of their source data + model configs + model definitions.

Our docs for dbt-snowflake say that:

Snowflake's merge statement fails with a "nondeterministic merge" error if the unique_key specified in your model config is not actually unique. If you encounter this error, you can instruct dbt to use a two-step incremental approach by setting the incremental_strategy config for your model to delete+insert.

So possibly they are using the delete+insert strategy with a platform that supports it. Or possibly the platform they are using has a merge statement that allows for non-unique keys.

@fathom-parth
Copy link
Author

fathom-parth commented Dec 13, 2023

@dbeatty10
Thanks for all the links!

I did look into the insert_overwrite strategy but that wouldn’t work for us since that overwrites entire partitions.

We currently set partitioning on the ingest timestamp column (a datetime); however, the partitioning is set to the granularity of a month in accordance to BQ’s best practices so we don’t run into the maximum number of allowed partitions and we follow BQ's guidelines around the most optimal size per partition.

Overwriting the entire month of data would be expensive and unideal. Ideally we’d only overwrite rows from the same exact timestamp

Lmk if I’ve grossly misunderstood insert overwrite or if I’m missing something obvious!!

Edited at 2023-12-13 11:13am ET for clarity (wrote the above half-asleep)

@daronjp
Copy link
Contributor

daronjp commented Dec 13, 2023

@dbeatty10 responding to your question from Slack:

Using the Snowflake adapter

-- my_model.sql
{{
    config(
        ...
        materialized='incremental',
        incremental_strategy='delete+insert',
        unique_key='data_date'
        ...
    )
}}

SELECT
    data_date,
    col_1,
    col_2
FROM {{ ref('my_intermediate_model' }}
WHERE
    ...
    {% if is_incremental() %}
        AND data_date >= CURRENT_DATE - 10
    {% endif %}

@fathom-parth
Copy link
Author

Interesting, I guess the BQ specific documentation doesn't mention the unique_key needing to be unique:
https://docs.getdbt.com/reference/resource-configs/bigquery-configs#the-merge-strategy

@fathom-parth
Copy link
Author

Ah though looking through the code on how the merge statement gets generated by the BQ adapter (seems to use the default merge statement from dbt-core) and how BQ handles merge statements, I'm pretty sure it'll fail if the predicate matches multiple rows.

@dbeatty10 dbeatty10 changed the title Incremental Models - merge strategy with unique_key possibly has different behavior than stated Incremental Models - unique_key possibly has different behavior than stated Dec 13, 2023
@dbeatty10
Copy link
Contributor

There are two different use cases in which unique_key is utilized today for incremental materializations:

  1. “merge on an actually unique key”
  2. “replace partitions”

For (1), the upstream source would benefit from adding unique and not_null checks since it must be unique to behave correctly.

For (2), there should not be any uniqueness check, because we'd expect many rows to have the same value for unique_key. This case is a flagrant misnomer, of course 😱

It would be nice to formally distinguish between each of those use-cases, and dbt-labs/dbt-core#9490 might be a good way to accomplish that.

@matthewshaver
Copy link
Contributor

I appears this may have been resolved via other methods. If any updates to the docs are required, please feel free to re-open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Improvements or additions to content improvement Use this when an area of the docs needs improvement as it's currently unclear
Projects
None yet
Development

No branches or pull requests

5 participants