Skip to content

Latest commit

 

History

History
131 lines (100 loc) · 5.67 KB

README.md

File metadata and controls

131 lines (100 loc) · 5.67 KB

County Health Dashboard

This repo was created for a Udacity Data Engineering course. This repo is my final project for that course.

image info


NOTE: For the UI to properly show up the user will need to create an AWS RDS postgres db and run the datapipeline to popluate their database. The user will also need to start the NodeJS server. After the pipeline has been run and the server has been started the user can then start the client/UI. If you don't follow this order the UI will be blank and errors will show in the browser console.


Datapipeline

The datapipeline folder holds all the code responsible for pulling data from the datapipeline/datasets folder into the AWSRDS Postgres DB
The fast food data came from usda.gov and can be seen here: https://www.ers.usda.gov/data-products/food-environment-atlas/go-to-the-atlas/
The mortality data came from the CDC wonder database and can be seen here: https://wonder.cdc.gov/Deaths-by-Underlying-Cause.html
The mortality data was too large and can't be uploaded to github so the etl.py file references only a subset of the original data.

If you want to run the datapipeline into your own AWS RDS DB you can follow the below steps:



  1. Add a rds.cfg file to the datapipeline/etl folder with the below details:
[DBCONNECT]
HOST=mydbendpoint
DB_NAME=mydbname
DB_USER=myuser
DB_PASSWORD=mypassword
DB_PORT=mydbport
  1. Open a cmd prompt to the datapipeline folder and enter the below to setup your virtual environment
env_setup.cmd
  1. Activate your virtualenvironment
env\scripts\activate
  1. Run create_tables.py
python etl/create_tables.py
  1. Run etl.py
python etl/etl.py

Server

The server folder holds all the back end nodejs code.
To get the back end up and running locally first add a .env file to the server folder and add the below details

HOST="mydbendpoint"
DB_NAME="mydbname"
DB_USER="myuser"
DB_PASSWORD="mypassword"
DB_PORT=mydbport

Next open a cmd prompt and cd into the folder

cd server

Then run npm i

npm i

Then run npm start

npm start

Client

The client folder holds all the front end code.
To get the front end up and running locally first cd into the folder

cd client

Then run npm i

npm i

Then run npm start

npm start

ERD

image info image info


How the Project Could be Altered


If the data was increased by 100x.

  • If the data was increased by 100x then analysis would need to be done to make sure the current datapipeline can handle such an amount of data. Currently the dataset pulls around 2 million rows from the CDC database and transforms the datasets to around 80,000 rows. The datapipeline, backend, and UI are all extremely performant with this amount of data. If the data were increased by 100x we would also need to change the tech stack that handles the data. AWS Redshift is a datawarehouse product and is able to handle large datasets much better than a traditional database. We would need to implement AWS Redshift.

If the pipelines were run on a daily basis by 7am.

  • If the pipelines were to be run on a daily basis at 7am the project would need to add Apache Airflow. After adding Airflow we could set up dags to run the data pipeline daily ay 7am.

If the database needed to be accessed by 100+ people.

  • If the database needed to be accessed by 100+ people this would be easily possible as the database is currently AWS RDS. One thing we would need to be cognizant of if we went with this route is permissions - who do we want to be able to access what data? Or do we want all of our business users to be able to access all the data? These are questions we would work with our stakeholders to answer.

Choice of Technologies and Tools

  • REACT: I chose react because I have experience with it and enjoy getting better at it. It also renders incredibly quickly which is perfect for a dashboard application.

  • NODEJS: I chose NodeJS because I have experience with it but am always trying to improve. NodeJS makes for a great reliable backend/server.

  • PostgreSQL: I chose Postgres as a DB because we previously used it in the Data Engineering course and I knew it was capable of handling the amount of data for this dashboard (~80,000 rows). Postgres is a more traditional Relational Database which is what I had in mind for this project.

  • The data in this project is intended to run once and only once. The data I gathered from the USDA on fast food restaurants is only available for the years 2011 and 2016. If someone wanted to run the data pipeline more consistently they would have to work with the USDA to get them to provide updated data on a consistent schedule. If the user wanted to they could however pull the mortality data in the database on a monthly or yearly schedule as the CDC updates there WONDER database frequently.

  • SCHEMA: I chose to have the datapipeline stream the data into two tables (fast_food and mortality) as that makes the most sense. The datapipeline creates a fast_food table of which the fips column is the primary key, the fips column is the county identifier code. The fips column is the foreign key on the mortality table. The fips column being the connecting column was done on purpose as the map that the UI has on it uses a fips code to identify each county on the map. When a specific county is clicked on the map it allows for the UI to easily pass along that code and query the backend/db for applicable data.