This repository contains the code for an ETL (Extract, Transform, Load) pipeline designed to extract data from Reddit, perform necessary transformations, and upload the data to AWS S3. The pipeline is implemented using Apache Airflow, Docker, and Python.
To use this ETL pipeline, follow these steps:
-
Clone this repository to your local machine:
-
Install Docker Desktop
-
Create a virtual environment uisng python:
python -m venv venv
-
Activate the virtual environment:
-
On Windows:
venv\Scripts\activate
-
On macOS and Linux:
source venv/bin/activate
-
-
Install the required Python dependencies:
pip install -r requirements.txt
-
Obtain your Reddit API key and AWS secret key. These keys are required for accessing the Reddit API and AWS S3, respectively.
-
Make changes to the config file config.conf
To run the ETL pipeline:
-
Start Docker containers:
docker-compose up -d --build
-
Trigger the Airflow DAG
etl_reddit_pipeline
either manually from the Airflow UI or using the Airflow CLI.http://localhost:8080/
-
Monitor the progress of the DAG execution in the Airflow UI.
There are 5 main places to make changes
- dags
- etls
- pipelines
- config
- utils