01 - Topic Introduction
- A music streaming startup needs to scale up their user base and song database. To achieve this, they’re looking to migrate their processes and data onto the cloud. Their data currently resides in AWS S3 bucket. This bucket contains two folders: one with JSON files recording user activity within the app, and another with JSON files containing metadata for all the songs available.
- Task is to build an ETL Pipeline that extracts their data from S3, staging it in Amazon Redshift and then transforming data into a set of Dimensional and Fact Tables for their Analytics Team to continue finding insights to what songs their users are listening to.
02 - Dataset Overview
2.1 Data Sample - Song Data Path → s3://udacity-dend/song_data
{
"num_songs": 1,
"artist_id": "ARJIE2Y1187B994AB7",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "",
"artist_name": "Line Renaud",
"song_id": "SOUPIRU12A6D4FA1E1",
"title": "Der Kleine Dompfaff",
"duration": 152.92036,
"year": 0
}
2.2 Data Sample - Log Data Path → s3://udacity-dend/log_data
{
"artist":null,
"auth":"LoggedIn",
"firstName":"Walter",
"gender":"M",
"itemInSession":0,
"lastName":"Frye",
"length":null,
"level":"free",
"location":"San Francisco-Oakland-Hayward,CA",
"method":"GET",
"page":"Home",
"registration":1540919166796.0,
"sessionId":38,
"song":null,
"status":200,
"ts":1541105830796,
"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
"userId":"39"
}
2.3 Log Data JSON Path → s3://udacity-dend/log_jason_path.json
03 - Main Idea Development of the solution
3.1 - Schema Design for Song Play Analysis
- A Star Schema Design (denormalizes data by eliminating redundant tables, simplicity, efficiency, and improved query performance) would be chosen for optimized queries on song play queries.
Fact Table
-
songplays - records in event data associated with song plays i.e. records with page NextSong songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimension Tables
-
users - users in the app user_id, first_name, last_name, gender, level
-
songs- songs in music database song_id, title, artist_id, year, duration
-
artists - artists in music database artist_id, name, location, lattitude, longitude
-
time - timestamps of records in songplays broken down into specific units start_time, hour, day, week, month, year, weekday
3.2 - Create Table Schema
-
Instead of reading data directly from the s3 buckets into the final database, this project will make use of a staging table to act as an intermediary between the s3 bucket and the final database.
-
There are two staging tables staging_events and the staging_songs tables. These tables are to temporally hold data from the S3 Bucket before being transformed and inserted into the primary use tables.
Steps
- Write a SQL CREATE statement for each of these tables in sql_queries2.py
- Complete the logic in create_tables.py to connect to the database and create these tables
- Write SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, you can run create_tables.py whenever you want to reset your database and test your ETL pipeline.
- Launch a redshift cluster and create an IAM role that has read access to S3.
- Add redshift database and IAM role info to dwhhuyen.cfg.
- Test by running create_tables.py and checking the table schemas in your redshift database.
3.3 - Build ETL Pipeline
- Implement the logic in etl.py to load data from S3 to staging tables on Redshift.
- Implement the logic in etl.py to load data from staging tables to analytics tables on Redshift.
- Test by running etl.py after running create_tables.py and running the analytic queries on your Redshift database to compare your results with the expected results.
- Delete your redshift cluster when finished.
3.4 - Tool Use
- AWS Redshift
- AWS VPC
- SQL
- Python
- Anaconda Prompt
- Visual Studio Code
04 - Processing Steps
4.3 - Create AWS Cluster
- Using Cloud Shell
aws redshift create-cluster --node-type ra3.xplus --number-of-nodes 2 --master-username adminuser --master-user-password TopSecret1 --cluster-identifier mycluster
4.4 - Authorize Security Access Group to Default TCP/IP Address - AWS VPC configuration
4.5 - Set up the main dwhhuyen.cfg
[CLUSTER]
HOST=
DB_NAME=
DB_USER=
DB_PASSWORD=
DB_PORT=
[IAM_ROLE]
ARN='IAM Role arn'
[S3]
LOG_DATA='s3://udacity-dend/log_data'
LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
SONG_DATA='s3://udacity-dend/song_data'
[AWS]
KEY=
SECRET=
REGION_NAME=
4.7 - Run the etl.py script to extract data from the files in S3, stage it in redshift, and finally store it in the dimensional tables.
4.8 - Clean up resources
redshift.delete_cluster( ClusterIdentifier=DWH_CLUSTER_IDENTIFIER, SkipFinalClusterSnapshot=True)
myClusterProps = redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]
prettyRedshiftProps(myClusterProps)
iam.detach_role_policy(RoleName=DWH_IAM_ROLE_NAME, PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess")
iam.delete_role(RoleName=DWH_IAM_ROLE_NAME)
05 - Results & Recommendations
5.1 - Results
- Number of rows in each table
5.2 - Recommendations
-
Try it with Snowflake Schema Design ( normalization to reduce data redundancy and improve query performance.)
-
Try it with AWS Glue.