This product uses the TMDB API but is not endorsed or certified by TMDB.
This project leverages The Movie Database (TMDB) API to extract and load data into Google BigQuery, transforms the data using dbt, and visualizes insights using Evidence. The goal is to provide key points of movie & tv series trends that helps media professionals and enthusiasts understand what content captures viewers' interest.
This pipeline is designed to streamline the process of data extraction, loading, transformation, and reporting. It uses modern data engineering tools and practices to ensure scalability and reproducibility.
- dlt (Data Load Tool): For extracting and loading data into Google Bigquery.
- dbt (Data Build Tool): For transforming data within BigQuery.
- Evidence.dev: Code-driven alternative to drag-and-drop BI tools.
- Docker: For containerization of the pipeline.
- Prefect: For workflow orchestration.
- Terraform: For Infrastucture as Code (IaC).
- Google BigQuery: The Data Warehouse.
- DuckDB: For local testing.
I know using GCS is part of the evaluation criteria, however, I intentionally did not include it in this project for the following reasons:
- Data Volume: The data volume from TMDb API is manageable within BigQuery without the need for intermediate storage.
- Complexity and Cost: Avoiding GCS simplifies the architecture and reduces costs associated with storage and data transfer, especially beneficial for small to medium datasets.
- Misconceptions about "Data Lake": New data engineers often believe that integrating cloud storage like Google Coud Storage (GCS) or AWS S3 is a mandatory step in data pipelines. However, this is not always necessary and can sometimes introduce unneccesary complexity and costs. In scenarios where data can be directly ingested and processed by data warehousing solutions like BigQuery, bypassing intermediate cloud storage can streamline workflows and reduce overhead.
- Docker installed
- Python installed
- Terraform installed
- Make: While
make
is readily available and commonly used on Linux and macOs, it is not included by default in windows. Using Chocolately (a package manager for windows) can be easily installed:choco install make
- Node.js installed: This is to run Evidence.dev, "Build Polished data products with SQL"
- dlt credentials: Click here for instructions how to add credentials under .dlt/secrets.toml.
- evicence.dev credentials: Click here for instructions to connect your local development environment to BigQuery.
- Create Google Cloud Project
- Google Cloud Platform Credentials JSON
- DuckDB: This is completely optional, but in case you want to test your dlt python script locally, install DuckDB.
- Generate API Key from TMDB themoviedb.org
Be sure to create .env
file, and ensure is configured correctly for your dbt profiles.yml
- dbt Configuration: Ensure
~/.dbt/profiles.yml
is correctly set up to connect to your BigQuery instance. - dlt Configuration: Update
secrets.toml
under.dlt/
with your keys from themoviedb.org and Google BigQuery . - prefect Configuration: Ensure to change in prefect.yaml your
prefect.deployments.steps.set_working_directory
I want to clarify the purpose and setup of Terraform within this project. The configuration files located in the terraform folder primarly ensure that the enviroment is correclty prepared, especially regarding the credentials file. Technically it is to be sure your keys are correct. That's it.
Fortunately dlt
handles the creation of the necessary datasets, and given the simplicity of this project, using Terraform isn't essential, but it helps in ensuring that all system components are properly configured before running the pipeline.
If you decide to test, then you must update variable "credentials_file"
default path. (go to terraform/variables.tf)
# Move to terraform folder
cd terraform/
# init project
terraform init
# plan
terraform plan
# apply
terraform apply
You can refer to the help
command for guidance on what commands are available and what each command does:
make help
Output:
Usage:
make setup_uv - Instructions of uv using a script, system package manager, or pipx
make install_dependencies - Installs python dependencies using uv
make create_venv - Creates a virtual environment using uv
make activate_venv - Instructions to activate python virtual environment
make run_prefect_server - Runs prefect localhost server
make deploy_prefect - Deploys Prefect flows
make start_evidence - Sets up and runs the evidence.dev project
This command will display all available options and their descriptions, allowing you to easily understand how to interact with your project using the make commands.
- Clone the repository and use Terraform:
git clone [email protected]:theDataFixer/de-zoomcamp-project.git
cd de-zoomcamp-project
- Install uv (An extremely fast Python package installer and resolver, written in Rust), and activate the virtual environment
make setup_uv
make create_venv
make activate_venv
- Install python dependencies
make install_dependencies
- Start Prefect Server:
make run_prefect_server
- Deploy Prefect Flows:
make deploy_prefect
After deploy, you'll get a message in terminal to start worker with your chosen pool name, and go to localhost:4200 and run workflow.
- Start and use Evidence.dev:
make start_evidence
In case you get error in Prefect sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
then you should change your database to postgresql. Instructions here
In short, run:
docker run -d --name prefect-postgres -v prefectdb:/var/lib/postgresql/data -p 5432:5432 -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=yourTopSecretPassword -e POSTGRES_DB=prefect postgres:latest
And then:
prefect config set PREFECT_API_DATABASE_CONNECTION_URL="postgresql+asyncpg://postgres:yourTopSecretPassword@localhost:5432/prefect"
Feel free to reach out to me if you have any questions, comments, suggestions, or feedback: theDataFixer.xyz