Skip to content

πŸŽ“ Data pipeline for analysis of postsecondary education data

License

Notifications You must be signed in to change notification settings

hawkirk/scorecard_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

College Scorecard Data Pipeline

Data pipeline for analysis of postsecondary education data. Data sourced from the College Scorecard API, Python-based ETL process orchestrated with Airflow.

github_cover

Table of Contents

Goals

This project seeks to use publicly available postsecondary education data from the College Scorecard API to investigate the relationships between demographics and tuition costs. Some other potential research questions that could be explored using these data:

  • Are secular instiutuions more racially diverse than religious institutions?
  • What are the historic trends of enrollment at male-only/female-only institutions?
  • In what regions or communities in the United States are for-profit institutions most common?

Architecture

college_scorecard_pipeline_architecture

ETL Overview

DAG tasks, in order of execution:

  1. Extract data from the U.S. Department of Education College Scorecard API
  2. Serialize data as JSON to /data/raw/ in project directory
  3. Upload raw file to AWS S3 raw bucket
  4. Transform data with pandas, serialize cleaned CSV file to /data/clean/
  5. Upload clean file to AWS S3 clean bucket
  6. Clean data is loaded into AWS RDS instance

Airflow DAG graph: airflow_dag

Project folder structure

β”œβ”€β”€ dags
β”‚   β”œβ”€β”€ dag.py               <- Airflow DAG
β”‚   └── dag_functions        
β”‚       β”œβ”€β”€ extract.py       <- API extraction function
β”‚       └── transform.py     <- data processing/cleaning function
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ raw                  <- raw data pull from College Scorecard API
β”‚   └── clean                <- processed data in CSV format
β”œβ”€β”€ db_build
β”‚   β”œβ”€β”€ create_tables.SQL    <- create table shells
β”‚   └── create_views.SQL     <- create table views
β”œβ”€β”€ dashboard.py             <- Plotly dashboard app
β”œβ”€β”€ LICENSE                  <- MIT license
β”œβ”€β”€ README.md                <- Top-level project README
└── docker-compose.yaml      <- Docker-Compose file w/ Airflow config

Project Setup

⚠️ Note: This project is no longer maintained and potentially unstable, so running is not advised at this time. Some preliminary instructions about Docker and Airflow configuration are provided below.

In order to execute the DAG, you’ll need to store some information in a .ENV file in the top-level project directory. You must add .ENV to project .gitignore file before publishing anything to GitHub!

It should look something like this:

API_KEY=[insert College Scorecard API key here]
AWS_ACCESS_KEY_ID=[insert AWS Access Key ID here]
AWS_SECRET_ACCESS_KEY=[insert AWS Secret Access Key here]
AIRFLOW_UID=501

No need to change AIRFLOW_UID - this is a constant used to set up the Airflow admin.

Running Airflow in Docker

Refer to the official Airflow docs for more information.

  1. Install Docker and Docker-Compose first if you don’t already have it.
  2. Direct a terminal to your project directory and execute the code below. This will generate docker-compose.yaml
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.3/docker-compose.yaml'
  1. Make sure you've set up your .ENV file properly. Initialize 3 folders in your top-level directory: /dags/, /logs/, and /plugins/.
  2. With the Docker application running on your computer, execute docker-compose up airflow-init in the terminal. This will initialize the Airflow instance. It will create an Admin login with username and password both set to β€œairflow” by default.
  3. Finally, execute docker-compose up. This runs everything specified in docker-compose.yaml. You can check the health of your containers by opening a new terminal in the same directory and executing docker ps. You should now be able to open your web browser and go to localhost:8080 to log in to the Airflow web client.

Execute docker-compose down --volumes --rmi all to stop and delete all running containers, delete volumes with database data and downloaded images.

References

Docs:

Helpful articles/videos:

Architecture inspo:

Major shout-out to Amanda Jayapurna for designing the cover image!

About

πŸŽ“ Data pipeline for analysis of postsecondary education data

Topics

Resources

License

Stars

Watchers

Forks

Languages