nyc-taxi | Event-Driven architecture

Overview

To build the event-driven pipeline within the AWS infrastructures.

nyc-taxi-project
- data
  - the data folder contains ingest and processed folder.
  - ingest folder has nyc taxi parquet files.
  - processed folder has transformed data from spark jobs.
- log
  - any log information will be provided from EMR cluster after a job is submitted.
- python
  - pyspark code is stored in this folder.
Lambda
- Triggers when ingestion is done from nyc-taxi-project/data/ingest
- calls the Airflow API and passes the bucket name and key for the data ingestion.
Airflow
- a dockerized container in EC2
- receives the bucket name and the key from Lambda and calls the EMROperator by using spark-submit
Superset with Athena
- a dockerized container in EC2
- visualize the dataset with AWS Athena

EMR
- Runs the spark job and performs an ETL with the given file
- spark-submit is used to trigger the job.
- the spark code is stored under s3/python folder.
Glue
- manually runs the crawling to update the catalogue for Athena
Athena
- Athena can be run with the superset or manually run in the AWS service.

ingestion-airflow.mov

emr.mov

crawling-athena.mov

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
airflow		airflow
images		images
lambda		lambda
nyc-taxi-project		nyc-taxi-project
spark		spark
superset		superset
.gitattributes		.gitattributes
README.md		README.md