Skip to content

Latest commit

 

History

History
194 lines (155 loc) · 9.27 KB

File metadata and controls

194 lines (155 loc) · 9.27 KB

Data Pipeline Project for Book Recommendation

Table of Contents
  1. Introduction
  2. Project Architecture
  3. Getting Started
  4. Data Ingestion
  5. Data Transformation
  6. Data Visualization
  7. Contact
  8. Acknowledgments

Introduction

This project is part of the Data Engineering Zoomcamp. As part of the project, I developed a data pipeline to load and process data from a Kaggle dataset containing bookstore information for a book recommendation system. The dataset can be accessed on this Kaggle.

This dataset offers book ratings from users of various ages and geographical locations. It comprises three files: User.csv, which includes age and location data of bookstore users; Books.csv, containing information such as authors, titles, and ISBNs; and Ratings.csv, which details the ratings given by users for each book. Additional information about the dataset is available on Kaggle.

The primary objective of this project is to establish a streamlined data pipeline for obtaining, storing, cleansing, and visualizing data automatically. This pipeline aims to address various queries, such as identifying top-rated publishers and authors, and analyzing ratings based on geographical location.

Given that the data is static, the data pipeline operates as a one-time process.

Built With

Project Architecture

architecture Cloud infrastructure is set up with Terraform.

Airflow is run on a local docker container.

Getting Started

Prerequisites

  1. A Google Cloud Platform account.
  2. A kaggle account.
  3. Install VSCode or Zed or any other IDE that works for you.
  4. Install Terraform
  5. Install Docker Desktop
  6. Install Google Cloud SDK
  7. Clone this repository onto your local machine.

Create a Google Cloud Project

  • Go to Google Cloud and create a new project.
  • Get the project ID and define the environment variables GCP_PROJECT_ID in the .env file located in the root directory
  • Create a Service account with the following roles:
    • BigQuery Admin
    • Storage Admin
    • Storage Object Admin
    • Viewer
  • Download the Service Account credentials and store it in $HOME/.google/credentials/.
  • You need to activate the following APIs here
    • Cloud Storage API
    • BigQuery API
  • Assign the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of your JSON credentials file, such that GOOGLE_APPLICATION_CREDENTIALS will be $HOME/.google/credentials/<authkeys_filename>.json
    • add this line to the end of the .bashrc file
    export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/google_credentials.json
    • Activate the enviroment variable by runing source .bashrc

Set up kaggle

  • A detailed description on how to authenicate is found here
  • Define the environment variables KAGGLE_USER and KAGGLE_TOKEN in the .env file located in the root directory. Note: KAGGLE_TOKEN is the same as KAGGLE_KEY

Set up the infrastructure on GCP with Terraform

  • Using Zed or VSCode, open the cloned project DE-2024-project-bookrecommendation.
  • To customize the default values of variable "project" and variable "region" to your preferred project ID and region, you have two options: either edit the variables.tf file in Terraform directly and modify the values, or set the environment variables TF_VAR_project and TF_VAR_region.
  • Open the terminal to the root project.
  • Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command cd terraform.
  • Set an alias alias tf='terraform'
  • Initialise Terraform: tf init
  • Plan the infrastructure: tf plan
  • Apply the changes: tf apply

Set up Airflow and Metabase

  • Please confirm that the following environment variables are configured in .env in the root directory of the project.
    • AIRFLOW_UID. The default value is 50000
    • KAGGLE_USERNAME. This should be set from Set up kaggle section.
    • KAGGLE_TOKEN. This should be set from Set up kaggle section too
    • GCP_PROJECT_ID. This should be set from Create a Google Cloud Project section
    • GCP_BOOK_RECOMMENDATION_BUCKET=book_recommendation_datalake_<GCP project id>
    • GCP_BOOK_RECOMMENDATION_WH_DATASET=book_recommendation_analytics
    • GCP_BOOK_RECOMMENDATION_WH_EXT_DATASET=book_recommendataion_wh
  • Run docker-compose up.
  • Access the Airflow dashboard by visiting http://localhost:8080/ in your web browser. The interface will resemble the following. Use the username and password airflow to log in.

Airflow

  • Visit http://localhost:1460 in your web browser to access the Metabase dashboard. The interface will resemble the following. You will need to sign up to use the UI.

Metabase

Data Ingestion

Once you've completed all the steps outlined in the previous section, you should now be able to view the Airflow dashboard in your web browser. Below will display as list of DAGs DAGS Below is the DAG's graph. DAG Graph To run the DAG, Click on the play button(Figure 1) Run Graph

Data Transformation

  • Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command cd data_dbt.
  • Generate a profiles.yml file within ${HOME}/.dbt, followed by defining a profile for this project as instructed below.
data_dbt_book_recommendation:
  outputs:
    dev:
      dataset: book_recommendation_analytics
      fixed_retries: 1
      keyfile: <location_google_auth_key>
      location: <preferred project region>
      method: service-account
      priority: interactive
      project: <preferred project id>
      threads: 6
      timeout_seconds: 300
      type: bigquery
  target: dev
  • To run all models, run dbt run -t dev
  • Navigate to your Google BigQuery project by clicking on this link. There, you'll find all the tables and views created by DBT. Big Query

Data Visualization

Please watch the provided video tutorial to configure your Metabase database connection with BigQuery.You have the flexibility to customize your dashboard according to your preferences. Additionally, this PDF linked below contains the complete screenshot of the dashboard I created.

Dashboard

Contact

Twitter: @iamraphson

Acknowledgments

I would like to extend my heartfelt gratitude to the organizers of the Data Engineering Zoomcamp for providing such a valuable course. The insights I gained have been instrumental in broadening my understanding of the field of Data Engineering. Additionally, I want to express my appreciation to my fellow colleague with whom I took the course. Thank you all for your support and collaboration throughout this journey.

🦅