GitHub - Huyen-P/DE_DWH_AWS_S3_RedShift: building etl pipelines to migrate music json data/ metadata files (semi-structured data) into a relational database stored in AWS Redshift cluster

01 - Topic Introduction

A music streaming startup needs to scale up their user base and song database. To achieve this, they’re looking to migrate their processes and data onto the cloud. Their data currently resides in AWS S3 bucket. This bucket contains two folders: one with JSON files recording user activity within the app, and another with JSON files containing metadata for all the songs available.
Task is to build an ETL Pipeline that extracts their data from S3, staging it in Amazon Redshift and then transforming data into a set of Dimensional and Fact Tables for their Analytics Team to continue finding insights to what songs their users are listening to.

02 - Dataset Overview

2.1 Data Sample - Song Data Path → s3://udacity-dend/song_data

{
  "num_songs": 1, 
  "artist_id": "ARJIE2Y1187B994AB7", 
  "artist_latitude": null, 
  "artist_longitude": null, 
  "artist_location": "", 
  "artist_name": "Line Renaud", 
  "song_id": "SOUPIRU12A6D4FA1E1",   
  "title": "Der Kleine Dompfaff", 
  "duration": 152.92036, 
  "year": 0
}

2.2 Data Sample - Log Data Path → s3://udacity-dend/log_data

{
  "artist":null,
  "auth":"LoggedIn",
  "firstName":"Walter",
  "gender":"M",
  "itemInSession":0,
  "lastName":"Frye",
  "length":null,
  "level":"free",
  "location":"San Francisco-Oakland-Hayward,CA",
  "method":"GET",
  "page":"Home",
  "registration":1540919166796.0,
  "sessionId":38,
  "song":null,
  "status":200,
  "ts":1541105830796,
  "userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143   Safari\/537.36\"",
  "userId":"39"
}

2.3 Log Data JSON Path → s3://udacity-dend/log_jason_path.json

03 - Main Idea Development of the solution

3.1 - Schema Design for Song Play Analysis

A Star Schema Design (denormalizes data by eliminating redundant tables, simplicity, efficiency, and improved query performance) would be chosen for optimized queries on song play queries.

Fact Table

songplays - records in event data associated with song plays i.e. records with page NextSong songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users in the app user_id, first_name, last_name, gender, level
songs- songs in music database song_id, title, artist_id, year, duration
artists - artists in music database artist_id, name, location, lattitude, longitude
time - timestamps of records in songplays broken down into specific units start_time, hour, day, week, month, year, weekday

3.2 - Create Table Schema

Instead of reading data directly from the s3 buckets into the final database, this project will make use of a staging table to act as an intermediary between the s3 bucket and the final database.
There are two staging tables staging_events and the staging_songs tables. These tables are to temporally hold data from the S3 Bucket before being transformed and inserted into the primary use tables.

Steps

Write a SQL CREATE statement for each of these tables in sql_queries2.py
Complete the logic in create_tables.py to connect to the database and create these tables
Write SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, you can run create_tables.py whenever you want to reset your database and test your ETL pipeline.
Launch a redshift cluster and create an IAM role that has read access to S3.
Add redshift database and IAM role info to dwhhuyen.cfg.
Test by running create_tables.py and checking the table schemas in your redshift database.

3.3 - Build ETL Pipeline

Implement the logic in etl.py to load data from S3 to staging tables on Redshift.
Implement the logic in etl.py to load data from staging tables to analytics tables on Redshift.
Test by running etl.py after running create_tables.py and running the analytic queries on your Redshift database to compare your results with the expected results.
Delete your redshift cluster when finished.

3.4 - Tool Use

AWS Redshift
AWS VPC
SQL
Python
Anaconda Prompt
Visual Studio Code

04 - Processing Steps

4.1 - Configure aws (connect aws to local machine)

4.2 - Create IAM user role and attach needed permission policies

4.3 - Create AWS Cluster

Using Cloud Shell

aws redshift create-cluster --node-type ra3.xplus --number-of-nodes 2 --master-username adminuser --master-user-password TopSecret1 --cluster-identifier mycluster

4.4 - Authorize Security Access Group to Default TCP/IP Address - AWS VPC configuration

VPC Review

Internet Gateway - Being attached to VPC

Route Tables

Security Group

outbound rules

inbound rules

4.5 - Set up the main dwhhuyen.cfg

[CLUSTER]
HOST=
DB_NAME=
DB_USER=
DB_PASSWORD=
DB_PORT=

[IAM_ROLE]
ARN='IAM Role arn'

[S3]
LOG_DATA='s3://udacity-dend/log_data'
LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
SONG_DATA='s3://udacity-dend/song_data'

[AWS]
KEY=
SECRET=
REGION_NAME=

4.6 - Run the create_table script to set up the database staging and analytical tables

4.7 - Run the etl.py script to extract data from the files in S3, stage it in redshift, and finally store it in the dimensional tables.

4.8 - Clean up resources

redshift.delete_cluster( ClusterIdentifier=DWH_CLUSTER_IDENTIFIER,  SkipFinalClusterSnapshot=True)

myClusterProps = redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]
prettyRedshiftProps(myClusterProps)

iam.detach_role_policy(RoleName=DWH_IAM_ROLE_NAME, PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess")
iam.delete_role(RoleName=DWH_IAM_ROLE_NAME)

05 - Results & Recommendations

5.1 - Results

Number of rows in each table

Table artist

Table songplays

Table songs

Table staging_events

Table staging_songs

Table time

Table user

5.2 - Recommendations

Try it with Snowflake Schema Design ( normalization to reduce data redundancy and improve query performance.)
Try it with AWS Glue.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
__pycache__		__pycache__
Amazon_Redshift.pptx		Amazon_Redshift.pptx
FINAL_PROJECT_DE_DWH_AWS.odp		FINAL_PROJECT_DE_DWH_AWS.odp
README.md		README.md
create_tables.py		create_tables.py
etl.py		etl.py
sql_queries2.py		sql_queries2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Huyen-P/DE_DWH_AWS_S3_RedShift

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages