Repo for Capstone Project of Data Science at Scale course offered by University of Washington on Coursera.
Average Blight Risk Visualization
Work with real data collected in Detroit to help urban planners predict blight (the deterioration and decay of buildings and older areas of large cities, due to neglect, crime, or lack of economic support).
- Filter NAs and invalid coordinates (outside the bounds of Detroit)
- Extract latitutude/longitude pair and address (in raw text) from 4 files
- Concatenate them into one data frame
- Clean up the address field (extract numbers, drop symbols, normalize spelling, expand abbreviations, etc)
- Cluster geolocations by fuzzy matching on address field and incident proximities (
eps = 0.000075
). - Represent each building with a rectangle centered at average coordinates.
- DBSCAN based on coordinates, no good.
- DBSCAN based on a combination of coordinates and address fields, impossible to do without rewriting algorithm because of the way that feature distances are computed.
- Map demolition permits to buildings, derive positive labels.
- Random sample a same amount of buildings with negative labels.
- Concatenate them into a "training" set.
This "training" set will later be divided into a (real) training set and a validation set. In this task it does not make much sense to use the remaining data as a "testing" set (at least no in a traditional sense) because we only got buildings that are not on the demolition list. And there's no way to figure out their true labels. So this part is a little bit like semi-supervised learning: I'll just evaluate the model on the validation set and use the remaining data for visualization and drawing conclusions. Anyway this is also what the task requires us to do.
I believe it's OK to jump right to Step 4.
- Derive features from
violations.csv
,calls.csv
andcrimes.csv
. Bascially counts of one-hot-encoded categorical variables. - Examine feature importance using random forest. Got a ~0.83 AUC score on OOB data.
Counts of violations and crimes are the simplist yet most important features. I even hadn't include a decaying propagation effect of bad incidents.
- Trained a Xgboost model, got a ~0.85 AUC score on OOB data
- Simplify the model and still got a~0.849 AUC score.
Present a summary with some visualizations.
- Explain the model.
- Make a Choropleth map of blight risks on out-of-sample data.