-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Microbatch: Using backfill results in inefficient delete statements (Snowflake adapter) #364
Comments
Thanks for reporting this @brian-franklin ! Based on the generated SQL that you shared, this looks like it is coming from here, so I'm going to transfer this to the This looks to me like the same thing reported in #150. If I remember correctly, we'd like to review and merge #110 prior to #151. |
dbt debug log.txt
However, despite there being records during this timeframe that were inserted the previous day, the result of this delete is 0 records updated. From the subsequent log: This is resulting in the records being inserted again duplicating the data. I noticed that if I remove the "TIMESTAMP" (before the string values) from the delete statement and cast the strings, then the query deletes records as expected. Here is the revised SQL I tested:
I believe the timezone is affecting how the values are being selected by the delete, but I cannot confirm. |
@brian-franklin @dbeatty10
So that the query above would look like
|
@dbeatty10 Yes, however looking at the dbt logs I can see that he code generated by dbt for the delete to handle the backfill is handling the timestamps differently than when it is generating the select for new records to insert. The proposed solution is to remove the TIMESTAMP from the backfill delete so that the 2 statements operate using the same timestamp logic. See below: Here is the logic to create the insert. Notice there is no special transformation of the landed_timestamp selection in the where clause in the source CTE.
Here is the logic generated for the delete to handle the specified backfill configured as 1 day in my case. The use of TIMESTAMP in front of the timestamp values is resulting in a selection of different records due to the timezone. In my table, the landed_timestamp is of type TIMESTAMP_TZ.
|
@brian-franklin thank you for mentioning the issue with using the That looks like the issue reported in dbt-labs/dbt-snowflake#1256 and fixed in dbt-labs/dbt-snowflake#1257. Wanna try with the latest release of dbt v1.9 and see if it resolves that particular piece? |
Hey @dbeatty10. I am happy to report that after installing the latest versions of core and snowflake adapter that this issue is resolved. Thank you for the assist. |
A team I'm working with also confirmed their load times dropped from 16 minutes to under 1 minute with this change 👍 Thanks team |
Is this a new bug in dbt-core?
Current Behavior
When running a model configured with the new microbatch strategy to backfill older dates with some previously loaded data, dbt produces a delete statement to remove records for the periods being backfilled. The delete statement produces a SQL statement for the Snowflake adapter that results in a cartesian product causing the statement to run for much longer than is necessary.
Expected Behavior
Existing records should be deleted efficiently without a cartesian join.
SQL to delete existing records should not be generated with a "using" clause with the temp table. (see steps to reproduce)
Steps To Reproduce
dbt run -s <model>
dbt run -s <model> --event-time-start "<start timestamp>" --event-time-end "<end timestamp>"
Example model config:
Example dbt commands:
dbt generated SQL to delete existing records:
The using with the temp table is unnecessary and causes a cartesian join.
Relevant log output
Environment
Which database adapter are you using with dbt?
snowflake
Additional Context
The text was updated successfully, but these errors were encountered: