Merge branch 'test-vale' into test-vale-pr

dbt-labs · Jun 17, 2024 · d7114f6 · d7114f6
2 parents f63b8a6 + b05396c
commit d7114f6
Show file tree

Hide file tree

Showing 114 changed files with 819 additions and 289 deletions.
diff --git a/website/blog/2021-02-05-dbt-project-checklist.md b/website/blog/2021-02-05-dbt-project-checklist.md
@@ -173,8 +173,8 @@ This post is the checklist I created to guide our internal work, and I’m shari
 
 Useful Links
 
-*   [FAQs for documentation](/docs/collaborate/documentation#faqs)
-*   [Doc blocks](/docs/collaborate/documentation#using-docs-blocks)
+*   [FAQs for documentation](/docs/build/documentation#faqs)
+*   [Doc blocks](/docs/build/documentation#using-docs-blocks)
 
 ## ✅ dbt Cloud specifics
 ----------------------------------------------------------------------------------------------------------------------------------------------------------

diff --git a/website/blog/2021-12-05-how-to-build-a-mature-dbt-project-from-scratch.md b/website/blog/2021-12-05-how-to-build-a-mature-dbt-project-from-scratch.md
@@ -87,7 +87,7 @@ The most important thing we’re introducing when your project is an infant is t
 
 * Introduce modularity with [{{ ref() }}](/reference/dbt-jinja-functions/ref) and [{{ source() }}](/reference/dbt-jinja-functions/source)
 
-* [Document](/docs/collaborate/documentation) and [test](/docs/build/data-tests) your first models
+* [Document](/docs/build/documentation) and [test](/docs/build/data-tests) your first models
 
 ![image alt text](/img/blog/building-a-mature-dbt-project-from-scratch/image_3.png)
 

diff --git a/website/blog/2022-09-28-analyst-to-ae.md b/website/blog/2022-09-28-analyst-to-ae.md
@@ -133,7 +133,7 @@ It’s much easier to keep to a naming guide when the writer has a deep understa
 
 If we want to know how certain logic was built technically, then we can reference the SQL code in dbt docs. If we want to know *why* a certain logic was built into that specific model, then that’s where we’d turn to the documentation.
 
-- Example of not-so-helpful documentation ([dbt docs can](https://docs.getdbt.com/docs/collaborate/documentation) build this dynamically):
+- Example of not-so-helpful documentation ([dbt docs can](https://docs.getdbt.com/docs/build/documentation) build this dynamically):
     - `Case when Zone = 1 and Level like 'A%' then 'True' else 'False' end as GroupB`
 - Example of better, more descriptive documentation (add to your dbt markdown file or column descriptions):
     - Group B is defined as Users in Zone 1 with a Level beginning with the letter 'A'. These users are accessing our new add-on product that began in Beta in August 2022. It's recommended to filter them out of the main Active Users metric.

diff --git a/website/blog/2023-02-14-passing-the-dbt-certification-exam.md b/website/blog/2023-02-14-passing-the-dbt-certification-exam.md
@@ -25,7 +25,7 @@ In this article, two Montreal Analytics consultants, Jade and Callie, discuss th
 
 **J:** To prepare for the exam, I built up a practice dbt project. All consultants do this as part of Montreal Analytics onboarding process, and this project allowed me to practice implementing sources and tests, refactoring SQL models, and debugging plenty of error messages. Additionally, I reviewed the [Certification Study Guide](https://www.getdbt.com/assets/uploads/dbt_certificate_study_guide.pdf) and attended group learning sessions.
 
-**C:** To prepare for the exam I reviewed the official dbt Certification Study Guide and the [official dbt docs](https://docs.getdbt.com/), and attended group study and learning sessions that were hosted by Montreal Analytics for all employees interested in taking the exam. As a group, we prioritized subjects that we felt less familiar with; for the first cohort of test takers this was mainly newer topics that haven’t yet become integral to a typical dbt project, such as [doc blocks](https://docs.getdbt.com/docs/collaborate/documentation#using-docs-blocks) and [configurations versus properties](https://docs.getdbt.com/reference/configs-and-properties). These sessions mainly covered the highlights and common “gotchas” that are experienced using these techniques. The sessions were moderated by a team member who had already successfully completed the dbt Certification, but operated in a very collaborative environment, so everyone could provide additional information, ask questions to the group, and provide feedback to other members of our certification taking group.
+**C:** To prepare for the exam I reviewed the official dbt Certification Study Guide and the [official dbt docs](https://docs.getdbt.com/), and attended group study and learning sessions that were hosted by Montreal Analytics for all employees interested in taking the exam. As a group, we prioritized subjects that we felt less familiar with; for the first cohort of test takers this was mainly newer topics that haven’t yet become integral to a typical dbt project, such as [doc blocks](https://docs.getdbt.com/docs/build/documentation#using-docs-blocks) and [configurations versus properties](https://docs.getdbt.com/reference/configs-and-properties). These sessions mainly covered the highlights and common “gotchas” that are experienced using these techniques. The sessions were moderated by a team member who had already successfully completed the dbt Certification, but operated in a very collaborative environment, so everyone could provide additional information, ask questions to the group, and provide feedback to other members of our certification taking group.
 
 I felt comfortable with the breadth of my dbt knowledge and had familiarity with most topics. However in my day-to-day implementation, I am often reliant on documentation or copying and pasting specific configurations in order to get the correct settings. Therefore, my focus was on memorizing important criteria for *how to use* certain features, particularly on the order/nesting of how the key YAML files are configured (dbt_project.yml, table.yml, source.yml).
 
@@ -75,4 +75,4 @@ Now, the first thing you must do when you’ve passed a test is to get yourself
 Standards and best practices are very important, but a test is a measure at a single point in time of a rapidly evolving industry. It’s also a measure of my test-taking abilities, my stress levels, and other things unrelated to my skill in data modeling; I wouldn’t be a good analyst if I didn’t recognize the faults of a measurement. I’m glad to have this check mark completed, but I will continue to stay up to date with changes, learn new data skills and techniques, and find ways to continue being a holistically helpful teammate to my colleagues and clients.
 
 
-You can learn more about the dbt Certification [here](https://www.getdbt.com/blog/dbt-certification-program/).
+You can learn more about the dbt Certification [here](https://www.getdbt.com/blog/dbt-certification-program/).
diff --git a/website/blog/2023-05-04-generating-dynamic-docs.md b/website/blog/2023-05-04-generating-dynamic-docs.md
@@ -215,7 +215,7 @@ Which in turn can be copy-pasted into a new `.yml` file. In our example, we writ
 
 ## Create docs blocks for the new columns
 
-[Docs blocks](https://docs.getdbt.com/docs/collaborate/documentation#using-docs-blocks) can be utilized to write more DRY and robust documentation. To use docs blocks, update your folder structure to contain a `.md` file. Your file structure should now look like this:
+[Docs blocks](https://docs.getdbt.com/docs/build/documentation#using-docs-blocks) can be utilized to write more DRY and robust documentation. To use docs blocks, update your folder structure to contain a `.md` file. Your file structure should now look like this:
 
 ```
 models/core/activity_based_interest

diff --git a/website/blog/2024-06-12-putting-your-dag-on-the-internet.md b/website/blog/2024-06-12-putting-your-dag-on-the-internet.md
@@ -0,0 +1,119 @@
+---
+title: Putting Your DAG on the internet
+description: "Use dbt and Snowflake's external access integrations to allow Snowflake Python models access the internet."
+slug: dag-on-the-internet
+
+authors: [ernesto_ongaro, sebastian_stan, filip_byrén]
+
+tags: [analytics craft, APIs, data ecosystem]
+hide_table_of_contents: false
+
+date: 2024-06-14
+is_featured: true
+---
+
+**New in dbt: allow Snowflake Python models to access the internet**
+
+With dbt 1.8, dbt released support for Snowflake’s [external access integrations](https://docs.snowflake.com/en/developer-guide/external-network-access/external-network-access-overview) further enabling the use of dbt + AI to enrich your data. This allows querying of external APIs within dbt Python models, a functionality that was required for dbt Cloud customer, [EQT AB](https://eqtgroup.com/). Learn about why they needed it and how they helped build the feature and get it shipped!
+
+<!--truncate-->
+## Why did EQT require this functionality? 
+by Filip Bryén, VP and Software Architect (EQT) and Sebastian Stan, Data Engineer (EQT)
+
+_EQT AB is a global investment organization and as a long-term customer of dbt Cloud, presented at dbt’s Coalesce [2020](https://www.getdbt.com/coalesce-2020/seven-use-cases-for-dbt) and [2023](https://www.youtube.com/watch?v=-9hIUziITtU)._
+
+_Motherbrain Labs is EQT’s bespoke AI team, primarily focused on accelerating our portfolio companies' roadmaps through hands-on data and AI work. Due to the high demand for our time, we are constantly exploring mechanisms for simplifying our processes and increasing our own throughput. Integration of workflow components directly in dbt has been a major efficiency gain and helped us rapidly deliver across a global portfolio._
+
+Motherbrain Labs is focused on creating measurable AI impact in our portfolio. We work hand-in-hand with leadership from our deal teams and portfolio company leadership but our starting approach is always the same: identify which data matters. 
+
+While we have access to reams of proprietary information, we believe the greatest effect happens when we combine that information with external datasets like geolocation, demographics, or competitor traction. 
+
+These valuable datasets often come from third-party vendors who operate on a pay-per-use model; a single charge for every piece of information we want. To avoid overspending, we focus on enriching only the specific subset of data that is relevant to an individual company's strategic question. 
+
+In response to this recurring need, we have partnered with Snowflake and dbt to introduce new functionality that facilitates communication with external endpoints and manages secrets within dbt. This new integration enables us to incorporate enrichment processes directly into our DAGs, similar to how current Python models are utilized within dbt environments. We’ve found that this augmented approach allows us to reduce complexity and enable external communications before materialization.
+
+## An example with Carbon Intensity: How does it work?
+
+In this section, we will demonstrate how to integrate an external API to retrieve the current Carbon Intensity of the UK power grid. The goal is to illustrate how the feature works, and perhaps explore how the scheduling of data transformations at different times can potentially reduce their carbon footprint, making them a greener choice. We will be leveraging the API from the [UK National Grid ESO](https://www.nationalgrideso.com/) to achieve this.
+
+To start, we need to set up a network rule (Snowflake instructions [here](https://docs.snowflake.com/en/user-guide/network-rules)) to allow access to the external API. Specifically, we'll create an egress rule to permit Snowflake to communicate with api.carbonintensity.org.
+
+Next, to access network locations outside of Snowflake, you need to define an external access integration first and reference it within a dbt Python model. You can find an overview of Snowflake's external network access [here](https://docs.snowflake.com/en/developer-guide/external-network-access/external-network-access-overview).
+
+This API is open and if it requires an API key, handle it similarly to managing secrets. More information on API authentication in Snowflake is available [here](https://docs.snowflake.com/en/user-guide/api-authentication).
+
+For simplicity’s sake, we will show how to create them using [pre-hooks](/reference/resource-configs/pre-hook-post-hook) in a model configuration yml file:
+
+
+```
+models:
+  - name: external_access_sample
+    config:
+      pre_hook: 
+        - "create or replace network rule test_network_rule type = host_port mode = egress value_list= ('api.carbonintensity.org.uk:443');"
+        - "create or replace external access integration test_external_access_integration allowed_network_rules = (test_network_rule) enabled = true;"
+```
+
+Then we can simply use the new external_access_integrations configuration parameter to use our network rule within a Python model (called external_access_sample.py):
+
+
+```
+import snowflake.snowpark as snowpark
+def model(dbt, session: snowpark.Session):
+    dbt.config(
+        materialized="table",
+        external_access_integrations=["test_external_access_integration"],
+        packages=["httpx==0.26.0"]
+    )
+    import httpx
+    return session.create_dataframe(
+            [{"carbon_intensity": httpx.get(url="https://api.carbonintensity.org.uk/intensity").text}]
+    )
+```
+
+
+The result is a model with some json I can parse, for example, in a SQL model to extract some information: 
+
+
+```
+{{
+    config(
+        materialized='incremental',
+        unique_key='dbt_invocation_id'
+    )
+}}
+
+with raw as (
+    select parse_json(carbon_intensity) as carbon_intensity_json
+    from {{ ref('external_access_demo') }}
+)
+
+select
+    '{{ invocation_id }}' as dbt_invocation_id,
+    value:from::TIMESTAMP_NTZ as start_time,
+    value:to::TIMESTAMP_NTZ as end_time,
+    value:intensity.actual::NUMBER as actual_intensity,
+    value:intensity.forecast::NUMBER as forecast_intensity,
+    value:intensity.index::STRING as intensity_index
+from raw,
+    lateral flatten(input => raw.carbon_intensity_json:data)
+```
+
+
+The result is a model that will keep track of dbt invocations, and the current UK carbon intensity levels.
+
+<Lightbox src="/img/blog/2024-06-12-putting-your-dag-on-the-internet/image1.png" title="Preview in dbt Cloud IDE of output" />
+
+## dbt best practices
+
+This is a very new area to Snowflake and dbt -- something special about SQL and dbt is that it’s very resistant to external entropy. The second we rely on API calls, Python packages and other external dependencies, we open up to a lot more external entropy. APIs will change, break, and your models could fail.
+
+Traditionally dbt is the T in ELT (dbt overview [here](https://docs.getdbt.com/terms/elt)), and this functionality unlocks brand new EL capabilities for which best practices do not yet exist. What’s clear is that EL workloads should be separated from T workloads, perhaps in a different modeling layer. Note also that unless using incremental models, your historical data can easily be deleted. dbt has seen a lot of use cases for this, including this AI example as outlined in this external [engineering blog post](https://klimmy.hashnode.dev/enhancing-your-dbt-project-with-large-language-models). 
+
+**A few words about the power of Commercial Open Source Software**
+
+In order to get this functionality shipped quickly, EQT opened a pull request, Snowflake helped with some problems we had with CI and a member of dbt Labs helped write the tests and merge the code in!  
+
+dbt now features this functionality in dbt 1.8+ or on “Keep on latest version” option of dbt Cloud (dbt overview [here](/docs/dbt-versions/upgrade-dbt-version-in-cloud#keep-on-latest-version)). 
+
+dbt Labs staff and community members would love to chat more about it in the [#db-snowflake](https://getdbt.slack.com/archives/CJN7XRF1B) slack channel.
diff --git a/website/blog/authors.yml b/website/blog/authors.yml
@@ -614,3 +614,30 @@ anders_swanson:
   links:
     - icon: fa-linkedin
       url: https://www.linkedin.com/in/andersswanson
+
+ernesto_ongaro:
+  image_url: /img/blog/authors/ernesto-ongaro.png
+  job_title: Senior Solutions Architect 
+  name: Ernesto Ongaro
+  organization: dbt Labs
+  links:
+    - icon: fa-linkedin
+      url: https://www.linkedin.com/in/eongaro
+
+sebastian_stan:
+  image_url: /img/blog/authors/sebastian-eqt.png
+  job_title: Data Engineer
+  name: Sebastian Stan
+  organization: EQT Group
+  links:
+    - icon: fa-linkedin
+      url: https://www.linkedin.com/in/sebastian-lindblom/
+
+filip_byrén:
+  image_url: /img/blog/authors/filip-eqt.png
+  job_title: VP and Software Architect
+  name: Filip Byrén
+  organization: EQT Group
+  links:
+    - icon: fa-linked
+      url: https://www.linkedin.com/in/filip-byr%C3%A9n/
diff --git a/website/docs/best-practices/how-we-structure/5-semantic-layer-marts.md b/website/docs/best-practices/how-we-structure/5-semantic-layer-marts.md
@@ -3,7 +3,7 @@ title: "Marts for the Semantic Layer"
 id: "5-semantic-layer-marts"
 ---
 
-The Semantic Layer alters some fundamental principles of how you organize your project. Using dbt without the Semantic Layer necessitates creating the most useful combinations of your building block components into wide, denormalized marts. On the other hand, the Semantic Layer leverages MetricFlow to denormalize every possible combination of components we've encoded dynamically. As such we're better served to bring more normalized models through from the logical layer into the Semantic Layer to maximize flexibility. This section will assume familiarity with the best practices laid out in the [How we build our metrics](/best-practices/how-we-build-our-metrics/semantic-layer-1-intro) guide, so check that out first for a more hands-on introduction to the Semantic Layer.
+The [dbt Semantic Layer](/docs/use-dbt-semantic-layer/dbt-sl) alters some fundamental principles of how you organize your project. Using dbt without the Semantic Layer necessitates creating the most useful combinations of your building block components into wide, denormalized marts. On the other hand, the Semantic Layer leverages MetricFlow to denormalize every possible combination of components we've encoded dynamically. As such we're better served to bring more normalized models through from the logical layer into the Semantic Layer to maximize flexibility. This section will assume familiarity with the best practices laid out in the [How we build our metrics](/best-practices/how-we-build-our-metrics/semantic-layer-1-intro) guide, so check that out first for a more hands-on introduction to the Semantic Layer.
 
 ## Semantic Layer: Files and folders
 
@@ -36,6 +36,40 @@ models
     └── stg_supplies.yml
 ```
 
+## Semantic Layer: Where and why?
+
+- 📂 **Directory structure**: Add your semantic models to `models/semantic_models` with directories corresponding to the models/marts files. This type of organization makes it easier to search and find what you can join. It also supports better maintenance and reduces repeated code.
+
+    <File name='models/marts/sem_orders.yml'>
+
+    ```yaml
+    semantic_models:
+      - name: orders
+        defaults:
+          agg_time_dimension: order_date
+        description: |
+          Order fact table. This table’s grain is one row per order.
+        model: ref('fct_orders')
+        entities:
+          - name: order_id
+            type: primary
+          - name: customer_id
+            type: foreign
+        dimensions:
+          - name: order_date
+            type: time
+            type_params:
+              time_granularity: day
+    ```
+    </File>
+
+## Naming convention
+
+- 🏷️ **Semantic model names**: Use the `sem_` prefix for semantic model names, such as `sem_cloud_user_account_activity`. This follows the same pattern as other naming conventions like `fct_` for fact tables and `dim_` for dimension tables.
+- 🧩 **Entity names**: Don't use prefixes in Entity within the semantic model. This keeps the names clear and focused on their specific purpose without unnecessary prefixes.
+
+This guidance helps you make sure your dbt project is organized, maintainable, and scalable, allowing you to take full advantage of the capabilities offered by the dbt Semantic Layer.
+
 ## When to make a mart
 
 - ❓ If we can go directly to staging models and it's better to serve normalized models to the Semantic Layer, then when, where, and why would we make a mart?