GitHub - waqarg2001/Youtube-Data-Pipeline-AWS: Leveraging AWS Cloud Services, an ETL pipeline transforms YouTube video statistics data. Data is downloaded from Kaggle, uploaded to an S3 bucket, and cataloged using AWS Glue for querying with Athena. AWS Lambda and Glue converts to Parquet format and stores it in a cleansed S3 bucket. AWS QuickSight then visualizes the materialised data.

Leveraging AWS Cloud Services, an ETL pipeline transforms YouTube video statistics data. Data is downloaded from Kaggle, uploaded to an S3 bucket, and cataloged using AWS Glue for querying with Athena. AWS Lambda converts to Parquet format and stores it in a clean S3 bucket. AWS QuickSight then visualizes the materialised data, providing insights into YouTube video performance.

Overview • Tools • Architecture • Dashboard • Screenshots • Support • License

Overview

This project utilizes AWS Cloud Services to build an efficient ETL pipeline for processing YouTube video statistics data. The data, available here, is downloaded from Kaggle and uploaded to an S3 bucket. AWS Glue catalogs the data, enabling seamless querying using Amazon Athena. The pipeline processes both JSON and CSV data, converting them into Parquet format. JSON data is transformed using AWS Lambda functions with AWS Data Wrangler layers, while CSV data is processed through visual ETL jobs in AWS Glue.

Data is first stored in a raw S3 bucket, then cleaned and organized in a cleansed bucket, and finally joined and stored in an analytics or materialized bucket. Automated ETL jobs run daily using AWS Glue workflows, ensuring up-to-date data processing. A simple QuickSight dashboard visualizes the cleansed data, providing valuable insights into YouTube video performance across different regions. This setup ensures a scalable and efficient data processing workflow, facilitating detailed analysis and reporting.

The repository directory structure is as follows:

├── assets/                        <- Includes assets for the repo.
│   └── (Contains images, architecture and quicksight dashboard)
│
├── data/                          <- Contains data used and processed by the project.
│   ├── raw/                      <- Raw data files (not included here due to large files size).
│   ├── cleansed/                 <- Cleansed data files.
│   └── analytics/                <- Materialized view for analytics and reporting.
│
├── docs/                          <- Documentation for the project.
│   └── solution methodology.pdf   <- Detailed project documentation.
│
├── scripts/                                       <- Python scripts for the ETL pipeline.
│   ├── etl_pipeline_csv_to_parquet.py             <- csv to parquet pipeline glue script.
│   ├── lambda_function.py                         <- Lambda function code.
│   └── etl_pipeline_materialised_view.py          <- materialised view pipeline glue script
│
├── README.md                      <- The top-level README for developers using this project.

Tools

To build this project, the following tools were used:

AWS S3
AWS Glue
AWS Lambda/Layers
Amazon Athena
AWS QuickSight
AWS Data Wrangler
AWS Cloudwatch
AWS IAM
Python
Pandas
Spark
Git

Architecture

Following is the architecture of the project.

Dashboard

Access simplified dashboard from here.

Screenshots

Following are project execution screenshots from AWS portal.

Support

If you have any doubts, queries, or suggestions then, please connect with me on any of the following platforms:

License

This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Tools

Architecture

Dashboard

Screenshots

Support

License

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
data		data
docs		docs
scripts		scripts
README.md		README.md

waqarg2001/Youtube-Data-Pipeline-AWS

Folders and files

Latest commit

History

Repository files navigation

Overview

Tools

Architecture

Dashboard

Screenshots

Support

License

About

Topics

Resources

Stars

Watchers

Forks

Languages