Skip to content

Commit

Permalink
Merge branch 'current' into test-vale
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 authored Jun 17, 2024
2 parents 08d82db + 3e2251e commit b05396c
Show file tree
Hide file tree
Showing 114 changed files with 819 additions and 289 deletions.
4 changes: 2 additions & 2 deletions website/blog/2021-02-05-dbt-project-checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,8 +173,8 @@ This post is the checklist I created to guide our internal work, and I’m shari

Useful Links

* [FAQs for documentation](/docs/collaborate/documentation#faqs)
* [Doc blocks](/docs/collaborate/documentation#using-docs-blocks)
* [FAQs for documentation](/docs/build/documentation#faqs)
* [Doc blocks](/docs/build/documentation#using-docs-blocks)

## ✅ dbt Cloud specifics
----------------------------------------------------------------------------------------------------------------------------------------------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ The most important thing we’re introducing when your project is an infant is t

* Introduce modularity with [{{ ref() }}](/reference/dbt-jinja-functions/ref) and [{{ source() }}](/reference/dbt-jinja-functions/source)

* [Document](/docs/collaborate/documentation) and [test](/docs/build/data-tests) your first models
* [Document](/docs/build/documentation) and [test](/docs/build/data-tests) your first models

![image alt text](/img/blog/building-a-mature-dbt-project-from-scratch/image_3.png)

Expand Down
2 changes: 1 addition & 1 deletion website/blog/2022-09-28-analyst-to-ae.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ It’s much easier to keep to a naming guide when the writer has a deep understa
If we want to know how certain logic was built technically, then we can reference the SQL code in dbt docs. If we want to know *why* a certain logic was built into that specific model, then that’s where we’d turn to the documentation.

- Example of not-so-helpful documentation ([dbt docs can](https://docs.getdbt.com/docs/collaborate/documentation) build this dynamically):
- Example of not-so-helpful documentation ([dbt docs can](https://docs.getdbt.com/docs/build/documentation) build this dynamically):
- `Case when Zone = 1 and Level like 'A%' then 'True' else 'False' end as GroupB`
- Example of better, more descriptive documentation (add to your dbt markdown file or column descriptions):
- Group B is defined as Users in Zone 1 with a Level beginning with the letter 'A'. These users are accessing our new add-on product that began in Beta in August 2022. It's recommended to filter them out of the main Active Users metric.
Expand Down
4 changes: 2 additions & 2 deletions website/blog/2023-02-14-passing-the-dbt-certification-exam.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ In this article, two Montreal Analytics consultants, Jade and Callie, discuss th

**J:** To prepare for the exam, I built up a practice dbt project. All consultants do this as part of Montreal Analytics onboarding process, and this project allowed me to practice implementing sources and tests, refactoring SQL models, and debugging plenty of error messages. Additionally, I reviewed the [Certification Study Guide](https://www.getdbt.com/assets/uploads/dbt_certificate_study_guide.pdf) and attended group learning sessions.

**C:** To prepare for the exam I reviewed the official dbt Certification Study Guide and the [official dbt docs](https://docs.getdbt.com/), and attended group study and learning sessions that were hosted by Montreal Analytics for all employees interested in taking the exam. As a group, we prioritized subjects that we felt less familiar with; for the first cohort of test takers this was mainly newer topics that haven’t yet become integral to a typical dbt project, such as [doc blocks](https://docs.getdbt.com/docs/collaborate/documentation#using-docs-blocks) and [configurations versus properties](https://docs.getdbt.com/reference/configs-and-properties). These sessions mainly covered the highlights and common “gotchas” that are experienced using these techniques. The sessions were moderated by a team member who had already successfully completed the dbt Certification, but operated in a very collaborative environment, so everyone could provide additional information, ask questions to the group, and provide feedback to other members of our certification taking group.
**C:** To prepare for the exam I reviewed the official dbt Certification Study Guide and the [official dbt docs](https://docs.getdbt.com/), and attended group study and learning sessions that were hosted by Montreal Analytics for all employees interested in taking the exam. As a group, we prioritized subjects that we felt less familiar with; for the first cohort of test takers this was mainly newer topics that haven’t yet become integral to a typical dbt project, such as [doc blocks](https://docs.getdbt.com/docs/build/documentation#using-docs-blocks) and [configurations versus properties](https://docs.getdbt.com/reference/configs-and-properties). These sessions mainly covered the highlights and common “gotchas” that are experienced using these techniques. The sessions were moderated by a team member who had already successfully completed the dbt Certification, but operated in a very collaborative environment, so everyone could provide additional information, ask questions to the group, and provide feedback to other members of our certification taking group.

I felt comfortable with the breadth of my dbt knowledge and had familiarity with most topics. However in my day-to-day implementation, I am often reliant on documentation or copying and pasting specific configurations in order to get the correct settings. Therefore, my focus was on memorizing important criteria for *how to use* certain features, particularly on the order/nesting of how the key YAML files are configured (dbt_project.yml, table.yml, source.yml).

Expand Down Expand Up @@ -75,4 +75,4 @@ Now, the first thing you must do when you’ve passed a test is to get yourself
Standards and best practices are very important, but a test is a measure at a single point in time of a rapidly evolving industry. It’s also a measure of my test-taking abilities, my stress levels, and other things unrelated to my skill in data modeling; I wouldn’t be a good analyst if I didn’t recognize the faults of a measurement. I’m glad to have this check mark completed, but I will continue to stay up to date with changes, learn new data skills and techniques, and find ways to continue being a holistically helpful teammate to my colleagues and clients.


You can learn more about the dbt Certification [here](https://www.getdbt.com/blog/dbt-certification-program/).
You can learn more about the dbt Certification [here](https://www.getdbt.com/blog/dbt-certification-program/).
2 changes: 1 addition & 1 deletion website/blog/2023-05-04-generating-dynamic-docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ Which in turn can be copy-pasted into a new `.yml` file. In our example, we writ

## Create docs blocks for the new columns

[Docs blocks](https://docs.getdbt.com/docs/collaborate/documentation#using-docs-blocks) can be utilized to write more DRY and robust documentation. To use docs blocks, update your folder structure to contain a `.md` file. Your file structure should now look like this:
[Docs blocks](https://docs.getdbt.com/docs/build/documentation#using-docs-blocks) can be utilized to write more DRY and robust documentation. To use docs blocks, update your folder structure to contain a `.md` file. Your file structure should now look like this:

```
models/core/activity_based_interest
Expand Down
119 changes: 119 additions & 0 deletions website/blog/2024-06-12-putting-your-dag-on-the-internet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
title: Putting Your DAG on the internet
description: "Use dbt and Snowflake's external access integrations to allow Snowflake Python models access the internet."
slug: dag-on-the-internet

authors: [ernesto_ongaro, sebastian_stan, filip_byrén]

tags: [analytics craft, APIs, data ecosystem]
hide_table_of_contents: false

date: 2024-06-14
is_featured: true
---

**New in dbt: allow Snowflake Python models to access the internet**

With dbt 1.8, dbt released support for Snowflake’s [external access integrations](https://docs.snowflake.com/en/developer-guide/external-network-access/external-network-access-overview) further enabling the use of dbt + AI to enrich your data. This allows querying of external APIs within dbt Python models, a functionality that was required for dbt Cloud customer, [EQT AB](https://eqtgroup.com/). Learn about why they needed it and how they helped build the feature and get it shipped!

<!--truncate-->
## Why did EQT require this functionality?
by Filip Bryén, VP and Software Architect (EQT) and Sebastian Stan, Data Engineer (EQT)

_EQT AB is a global investment organization and as a long-term customer of dbt Cloud, presented at dbt’s Coalesce [2020](https://www.getdbt.com/coalesce-2020/seven-use-cases-for-dbt) and [2023](https://www.youtube.com/watch?v=-9hIUziITtU)._

_Motherbrain Labs is EQT’s bespoke AI team, primarily focused on accelerating our portfolio companies' roadmaps through hands-on data and AI work. Due to the high demand for our time, we are constantly exploring mechanisms for simplifying our processes and increasing our own throughput. Integration of workflow components directly in dbt has been a major efficiency gain and helped us rapidly deliver across a global portfolio._

Motherbrain Labs is focused on creating measurable AI impact in our portfolio. We work hand-in-hand with leadership from our deal teams and portfolio company leadership but our starting approach is always the same: identify which data matters.

While we have access to reams of proprietary information, we believe the greatest effect happens when we combine that information with external datasets like geolocation, demographics, or competitor traction.

These valuable datasets often come from third-party vendors who operate on a pay-per-use model; a single charge for every piece of information we want. To avoid overspending, we focus on enriching only the specific subset of data that is relevant to an individual company's strategic question.

In response to this recurring need, we have partnered with Snowflake and dbt to introduce new functionality that facilitates communication with external endpoints and manages secrets within dbt. This new integration enables us to incorporate enrichment processes directly into our DAGs, similar to how current Python models are utilized within dbt environments. We’ve found that this augmented approach allows us to reduce complexity and enable external communications before materialization.

## An example with Carbon Intensity: How does it work?

In this section, we will demonstrate how to integrate an external API to retrieve the current Carbon Intensity of the UK power grid. The goal is to illustrate how the feature works, and perhaps explore how the scheduling of data transformations at different times can potentially reduce their carbon footprint, making them a greener choice. We will be leveraging the API from the [UK National Grid ESO](https://www.nationalgrideso.com/) to achieve this.

To start, we need to set up a network rule (Snowflake instructions [here](https://docs.snowflake.com/en/user-guide/network-rules)) to allow access to the external API. Specifically, we'll create an egress rule to permit Snowflake to communicate with api.carbonintensity.org.

Next, to access network locations outside of Snowflake, you need to define an external access integration first and reference it within a dbt Python model. You can find an overview of Snowflake's external network access [here](https://docs.snowflake.com/en/developer-guide/external-network-access/external-network-access-overview).

This API is open and if it requires an API key, handle it similarly to managing secrets. More information on API authentication in Snowflake is available [here](https://docs.snowflake.com/en/user-guide/api-authentication).

For simplicity’s sake, we will show how to create them using [pre-hooks](/reference/resource-configs/pre-hook-post-hook) in a model configuration yml file:


```
models:
- name: external_access_sample
config:
pre_hook:
- "create or replace network rule test_network_rule type = host_port mode = egress value_list= ('api.carbonintensity.org.uk:443');"
- "create or replace external access integration test_external_access_integration allowed_network_rules = (test_network_rule) enabled = true;"
```

Then we can simply use the new external_access_integrations configuration parameter to use our network rule within a Python model (called external_access_sample.py):


```
import snowflake.snowpark as snowpark
def model(dbt, session: snowpark.Session):
dbt.config(
materialized="table",
external_access_integrations=["test_external_access_integration"],
packages=["httpx==0.26.0"]
)
import httpx
return session.create_dataframe(
[{"carbon_intensity": httpx.get(url="https://api.carbonintensity.org.uk/intensity").text}]
)
```


The result is a model with some json I can parse, for example, in a SQL model to extract some information:


```
{{
config(
materialized='incremental',
unique_key='dbt_invocation_id'
)
}}
with raw as (
select parse_json(carbon_intensity) as carbon_intensity_json
from {{ ref('external_access_demo') }}
)
select
'{{ invocation_id }}' as dbt_invocation_id,
value:from::TIMESTAMP_NTZ as start_time,
value:to::TIMESTAMP_NTZ as end_time,
value:intensity.actual::NUMBER as actual_intensity,
value:intensity.forecast::NUMBER as forecast_intensity,
value:intensity.index::STRING as intensity_index
from raw,
lateral flatten(input => raw.carbon_intensity_json:data)
```


The result is a model that will keep track of dbt invocations, and the current UK carbon intensity levels.

<Lightbox src="/img/blog/2024-06-12-putting-your-dag-on-the-internet/image1.png" title="Preview in dbt Cloud IDE of output" />

## dbt best practices

This is a very new area to Snowflake and dbt -- something special about SQL and dbt is that it’s very resistant to external entropy. The second we rely on API calls, Python packages and other external dependencies, we open up to a lot more external entropy. APIs will change, break, and your models could fail.

Traditionally dbt is the T in ELT (dbt overview [here](https://docs.getdbt.com/terms/elt)), and this functionality unlocks brand new EL capabilities for which best practices do not yet exist. What’s clear is that EL workloads should be separated from T workloads, perhaps in a different modeling layer. Note also that unless using incremental models, your historical data can easily be deleted. dbt has seen a lot of use cases for this, including this AI example as outlined in this external [engineering blog post](https://klimmy.hashnode.dev/enhancing-your-dbt-project-with-large-language-models).

**A few words about the power of Commercial Open Source Software**

In order to get this functionality shipped quickly, EQT opened a pull request, Snowflake helped with some problems we had with CI and a member of dbt Labs helped write the tests and merge the code in!

dbt now features this functionality in dbt 1.8+ or on “Keep on latest version” option of dbt Cloud (dbt overview [here](/docs/dbt-versions/upgrade-dbt-version-in-cloud#keep-on-latest-version)).

dbt Labs staff and community members would love to chat more about it in the [#db-snowflake](https://getdbt.slack.com/archives/CJN7XRF1B) slack channel.
27 changes: 27 additions & 0 deletions website/blog/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -614,3 +614,30 @@ anders_swanson:
links:
- icon: fa-linkedin
url: https://www.linkedin.com/in/andersswanson

ernesto_ongaro:
image_url: /img/blog/authors/ernesto-ongaro.png
job_title: Senior Solutions Architect
name: Ernesto Ongaro
organization: dbt Labs
links:
- icon: fa-linkedin
url: https://www.linkedin.com/in/eongaro

sebastian_stan:
image_url: /img/blog/authors/sebastian-eqt.png
job_title: Data Engineer
name: Sebastian Stan
organization: EQT Group
links:
- icon: fa-linkedin
url: https://www.linkedin.com/in/sebastian-lindblom/

filip_byrén:
image_url: /img/blog/authors/filip-eqt.png
job_title: VP and Software Architect
name: Filip Byrén
organization: EQT Group
links:
- icon: fa-linked
url: https://www.linkedin.com/in/filip-byr%C3%A9n/
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: "Marts for the Semantic Layer"
id: "5-semantic-layer-marts"
---

The Semantic Layer alters some fundamental principles of how you organize your project. Using dbt without the Semantic Layer necessitates creating the most useful combinations of your building block components into wide, denormalized marts. On the other hand, the Semantic Layer leverages MetricFlow to denormalize every possible combination of components we've encoded dynamically. As such we're better served to bring more normalized models through from the logical layer into the Semantic Layer to maximize flexibility. This section will assume familiarity with the best practices laid out in the [How we build our metrics](/best-practices/how-we-build-our-metrics/semantic-layer-1-intro) guide, so check that out first for a more hands-on introduction to the Semantic Layer.
The [dbt Semantic Layer](/docs/use-dbt-semantic-layer/dbt-sl) alters some fundamental principles of how you organize your project. Using dbt without the Semantic Layer necessitates creating the most useful combinations of your building block components into wide, denormalized marts. On the other hand, the Semantic Layer leverages MetricFlow to denormalize every possible combination of components we've encoded dynamically. As such we're better served to bring more normalized models through from the logical layer into the Semantic Layer to maximize flexibility. This section will assume familiarity with the best practices laid out in the [How we build our metrics](/best-practices/how-we-build-our-metrics/semantic-layer-1-intro) guide, so check that out first for a more hands-on introduction to the Semantic Layer.

## Semantic Layer: Files and folders

Expand Down Expand Up @@ -36,6 +36,40 @@ models
└── stg_supplies.yml
```

## Semantic Layer: Where and why?

- 📂 **Directory structure**: Add your semantic models to `models/semantic_models` with directories corresponding to the models/marts files. This type of organization makes it easier to search and find what you can join. It also supports better maintenance and reduces repeated code.

<File name='models/marts/sem_orders.yml'>

```yaml
semantic_models:
- name: orders
defaults:
agg_time_dimension: order_date
description: |
Order fact table. This table’s grain is one row per order.
model: ref('fct_orders')
entities:
- name: order_id
type: primary
- name: customer_id
type: foreign
dimensions:
- name: order_date
type: time
type_params:
time_granularity: day
```
</File>
## Naming convention
- 🏷️ **Semantic model names**: Use the `sem_` prefix for semantic model names, such as `sem_cloud_user_account_activity`. This follows the same pattern as other naming conventions like `fct_` for fact tables and `dim_` for dimension tables.
- 🧩 **Entity names**: Don't use prefixes in Entity within the semantic model. This keeps the names clear and focused on their specific purpose without unnecessary prefixes.

This guidance helps you make sure your dbt project is organized, maintainable, and scalable, allowing you to take full advantage of the capabilities offered by the dbt Semantic Layer.

## When to make a mart

- ❓ If we can go directly to staging models and it's better to serve normalized models to the Semantic Layer, then when, where, and why would we make a mart?
Expand Down
Loading

0 comments on commit b05396c

Please sign in to comment.