This project aims to transform raw song and user data and load them into Redshift cluster for later analysis. This is also used to satisfied with Data Warehouse
project under Data Engineer Nanodegree Program.
- Python3
- Python virtual environment (aka
venv
) - AWS credentials/config files under
~/.aws
directories (see more: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html)
- Bootstrap virtual environment with dependencies
$ python3 -m venv ./venv $ source ./venv/bin/activate $ pip install -r requirements.txt
- Copy config template
template.dwh.cfg
todwh.cfg
.$ cp ./template.dwh.cfg ./dwh.cfg
- Fill
dwh.cfg
onCLUSTER
andMANIFEST
sections- For
CLUSTER
section, this will be used to construct Redshift cluster from scratch. We are free to choose our values. Here are possible values.
[CLUSTER] DB_NAME=dwh DB_USER=dwhuser DB_PASSWORD=<choose_whatever_you_want> DB_PORT=5439 CLUSTER_TYPE=multi-node NUM_NODES=4 NODE_TYPE=dc2.large CLUSTER_IDENTIFIER=dwhCluster IAM_ROLE_NAME=dwhRole
- For
MANIFEST
section, this refers to another S3 bucket storing Redshift manifest files that we will create later. Here are possible values.
[MANIFEST] BUCKET_NAME=sample-bucket-for-udacity-data-warehouse-project EVENT_DATA_KEY=sample-path/sample-log-data-manifest.json SONG_DATA_KEY=sample-path/sample-song-data-manifest.json
- For
- Prepare manifest files.
$ python prepare_manifest.py
- Spin up Redshift cluster.
$ python spin_dwh_up.py
- Create tables and do ETL.
$ python create_tables.py $ python etl.py
- When finished using Redshift cluster, tear it down.
$ python tear_dwh_down.py