dbt Core

dbt Core is a popular open-source command-line tool that's used for orchestrating and managing the transformation and modeling of data in your data warehouse. It's particularly well-suited for data analytics teams that work with SQL-based transformations in data warehousing environments. Here are some reasons why you might consider using DBT in your data workflows:

Modularity and Reusability
Version Control
Code Reviws and Collaboration
Data Testing
Incremental Processing
Easy Deployment
Documentation
Data Lineage
Elastic Compute Scaling
Community and Ecosystem

In this repo we will discover dbt Core with Apache Spark

Prerequisites

Python 3.10 (Version 3.8 or higher is suitable)
Spark (Version 3.1.1) Trift Server
Linux OS (to directly use the commands in)

Steps

Create Spark Trift Server using Docker and run
Copy csv files to docker container
Connect Spark container
Run beeline and connect hive
Create tables using csv files
Create venv in your local machine
Install dbt
Initilize and configure dbt
Check configuration using debug
Create sql and yml files for transformation
Run dbt
Create dbeaver connection and check the data easily

Create Spark Trift Server using Docker and run

This docker-compose starts a Spark Thrift server and a Postgres database as a Hive Metastore backend.

docker-compose up -d

docker ps

CONTAINER ID   IMAGE                      COMMAND                  CREATED       STATUS       PORTS                                              NAMES
0594d2d79ef6   godatadriven/spark:3.1.1   "/scripts/entrypoint…"   3 weeks ago   Up 3 hours   0.0.0.0:4040->4040/tcp, 0.0.0.0:10000->10000/tcp   dbt_core-dbt-spark3-thrift-1
ef7fb9edd3aa   postgres:9.6.17-alpine     "docker-entrypoint.s…"   3 weeks ago   Up 3 hours   0.0.0.0:5433->5432/tcp                             dbt_core-dbt-hive-metastore-1

Copy csv files to docker container

Change "your_path" and "container_id" as your info.

cd /*your_path*/dbt_core/seeds

docker cp raw_customers.csv *container_id*:/opt

docker cp raw_orders.csv *container_id*:/opt

docker cp raw_payments.csv *container_id*:/opt

Connect Spark container

docker-exec -it <container_id> bash

Run beeline and connect hive

/usr/spark/bin/beeline

!connect jdbc:hive2://localhost:10000

Example Output:

root@0594d2d79ef6:/# /usr/spark/bin/beeline
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Beeline version 2.3.7 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: train
Enter password for jdbc:hive2://localhost:10000: ********
Connected to: Spark SQL (version 3.1.1)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000>

Create tables using csv files

CREATE TABLE default.raw_customers (
    id int,
    first_name string,
    last_name string
    ) 
USING CSV
OPTIONS (path "/opt/raw_customers.csv",
        delimiter ",",
        header "true");

CREATE TABLE default.raw_orders (
    id int,
    user_id int,
    order_date date,
    status string
    )
USING CSV
OPTIONS (path "/opt/raw_orders.csv",
        delimiter ",",
        header "true");

CREATE TABLE default.raw_payments (
    id int,
    order_id int,
    payment_method string,
    amount bigint
    )
USING CSV
OPTIONS (path "/opt/raw_payments.csv",
        delimiter ",",
        header "true");

Create venv in your local machine

python3 -m venv dbt-env

source dbt-env/bin/activate

Install dbt

pip install "dbt-spark[PyHive]"

Note: sasl installation can be problem when installing the package. If you are using linux you can follow the steps here.

Initilize and configure dbt

dbt init tutorial_spark

After configuration, a folder system will be created.

Check configuration using debug

cd tutorial_spark

cd dbt debug

Create sql and yml files for transformation

Visit https://github.com/dbt-labs/jaffle_shop and get the "models" folder from there and copy the files into your tutorial_spark/models folder.

Run dbt

dbt run

Create dbeaver connection and check the data easily

Download Dbeaver https://dbeaver.io/download/

Resources

What is dbt?: https://docs.getdbt.com/docs/introduction
Apache Spark Setup for dbt: https://docs.getdbt.com/docs/core/connect-data-platform/spark-setup
Configuration and Properties of dbt: https://docs.getdbt.com/reference/configs-and-properties
Docker compose file source: https://github.com/dbt-labs/dbt-spark
Understanding Spark-SQL: https://spark.apache.org/docs/3.1.1/sql-distributed-sql-engine.html
Data Source: https://github.com/dbt-labs/jaffle_shop
Solving csv header problem: https://kb.databricks.com/sql/query-not-skip-header-ext-table

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docker		docker
images		images
seeds		seeds
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dbt Core

Prerequisites

Steps

Create Spark Trift Server using Docker and run

Copy csv files to docker container

Connect Spark container

Run beeline and connect hive

Create tables using csv files

Create venv in your local machine

Install dbt

Initilize and configure dbt

Check configuration using debug

Create sql and yml files for transformation

Run dbt

Create dbeaver connection and check the data easily

Resources

About

Releases

Packages

ElifSinemAktas/dbt-core

Folders and files

Latest commit

History

Repository files navigation

dbt Core

Prerequisites

Steps

Create Spark Trift Server using Docker and run

Copy csv files to docker container

Connect Spark container

Run beeline and connect hive

Create tables using csv files

Create venv in your local machine

Install dbt

Initilize and configure dbt

Check configuration using debug

Create sql and yml files for transformation

Run dbt

Create dbeaver connection and check the data easily

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages