Skip to content

building etl pipelines to migrate music json data/ metadata files (semi-structured data) into a relational database stored in AWS Redshift cluster

Notifications You must be signed in to change notification settings

Huyen-P/DE_DWH_AWS_S3_RedShift

Repository files navigation

01 - Topic Introduction

  • A music streaming startup needs to scale up their user base and song database. To achieve this, they’re looking to migrate their processes and data onto the cloud. Their data currently resides in AWS S3 bucket. This bucket contains two folders: one with JSON files recording user activity within the app, and another with JSON files containing metadata for all the songs available.
  • Task is to build an ETL Pipeline that extracts their data from S3, staging it in Amazon Redshift and then transforming data into a set of Dimensional and Fact Tables for their Analytics Team to continue finding insights to what songs their users are listening to.
image

02 - Dataset Overview

2.1 Data Sample - Song Data Path → s3://udacity-dend/song_data

{
  "num_songs": 1, 
  "artist_id": "ARJIE2Y1187B994AB7", 
  "artist_latitude": null, 
  "artist_longitude": null, 
  "artist_location": "", 
  "artist_name": "Line Renaud", 
  "song_id": "SOUPIRU12A6D4FA1E1",   
  "title": "Der Kleine Dompfaff", 
  "duration": 152.92036, 
  "year": 0
}

2.2 Data Sample - Log Data Path → s3://udacity-dend/log_data

{
  "artist":null,
  "auth":"LoggedIn",
  "firstName":"Walter",
  "gender":"M",
  "itemInSession":0,
  "lastName":"Frye",
  "length":null,
  "level":"free",
  "location":"San Francisco-Oakland-Hayward,CA",
  "method":"GET",
  "page":"Home",
  "registration":1540919166796.0,
  "sessionId":38,
  "song":null,
  "status":200,
  "ts":1541105830796,
  "userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143   Safari\/537.36\"",
  "userId":"39"
}

2.3 Log Data JSON Path → s3://udacity-dend/log_jason_path.json

03 - Main Idea Development of the solution

3.1 - Schema Design for Song Play Analysis

  • A Star Schema Design (denormalizes data by eliminating redundant tables, simplicity, efficiency, and improved query performance) would be chosen for optimized queries on song play queries.

image

Fact Table

  • songplays - records in event data associated with song plays i.e. records with page NextSong songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

  • users - users in the app user_id, first_name, last_name, gender, level

  • songs- songs in music database song_id, title, artist_id, year, duration

  • artists - artists in music database artist_id, name, location, lattitude, longitude

  • time - timestamps of records in songplays broken down into specific units start_time, hour, day, week, month, year, weekday

3.2 - Create Table Schema

  • Instead of reading data directly from the s3 buckets into the final database, this project will make use of a staging table to act as an intermediary between the s3 bucket and the final database.

  • There are two staging tables staging_events and the staging_songs tables. These tables are to temporally hold data from the S3 Bucket before being transformed and inserted into the primary use tables.

image

Steps

  1. Write a SQL CREATE statement for each of these tables in sql_queries2.py
  2. Complete the logic in create_tables.py to connect to the database and create these tables
  3. Write SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, you can run create_tables.py whenever you want to reset your database and test your ETL pipeline.
  4. Launch a redshift cluster and create an IAM role that has read access to S3.
  5. Add redshift database and IAM role info to dwhhuyen.cfg.
  6. Test by running create_tables.py and checking the table schemas in your redshift database.

3.3 - Build ETL Pipeline

  1. Implement the logic in etl.py to load data from S3 to staging tables on Redshift.
  2. Implement the logic in etl.py to load data from staging tables to analytics tables on Redshift.
  3. Test by running etl.py after running create_tables.py and running the analytic queries on your Redshift database to compare your results with the expected results.
  4. Delete your redshift cluster when finished.

3.4 - Tool Use

  • AWS Redshift
  • AWS VPC
  • SQL
  • Python
  • Anaconda Prompt
  • Visual Studio Code

04 - Processing Steps

4.1 - Configure aws (connect aws to local machine)

image

4.2 - Create IAM user role and attach needed permission policies

image

4.3 - Create AWS Cluster

  • Using Cloud Shell
aws redshift create-cluster --node-type ra3.xplus --number-of-nodes 2 --master-username adminuser --master-user-password TopSecret1 --cluster-identifier mycluster

4.4 - Authorize Security Access Group to Default TCP/IP Address - AWS VPC configuration

VPC Review

image

image

image

Internet Gateway - Being attached to VPC

image

Route Tables

image

Security Group

outbound rules

image

inbound rules

image

4.5 - Set up the main dwhhuyen.cfg

[CLUSTER]
HOST=
DB_NAME=
DB_USER=
DB_PASSWORD=
DB_PORT=

[IAM_ROLE]
ARN='IAM Role arn'

[S3]
LOG_DATA='s3://udacity-dend/log_data'
LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
SONG_DATA='s3://udacity-dend/song_data'

[AWS]
KEY=
SECRET=
REGION_NAME=

4.6 - Run the create_table script to set up the database staging and analytical tables

image

4.7 - Run the etl.py script to extract data from the files in S3, stage it in redshift, and finally store it in the dimensional tables.

image

4.8 - Clean up resources

redshift.delete_cluster( ClusterIdentifier=DWH_CLUSTER_IDENTIFIER,  SkipFinalClusterSnapshot=True)

myClusterProps = redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]
prettyRedshiftProps(myClusterProps)

iam.detach_role_policy(RoleName=DWH_IAM_ROLE_NAME, PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess")
iam.delete_role(RoleName=DWH_IAM_ROLE_NAME)

05 - Results & Recommendations

5.1 - Results

  • Number of rows in each table

image

image

Table artist

image

Table songplays

image

Table songs

image

Table staging_events

image

Table staging_songs

image

Table time

image

Table user

image

5.2 - Recommendations

  • Try it with Snowflake Schema Design ( normalization to reduce data redundancy and improve query performance.)

  • Try it with AWS Glue.

About

building etl pipelines to migrate music json data/ metadata files (semi-structured data) into a relational database stored in AWS Redshift cluster

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages