To help is figure out what was causing the errors in Zestimates, we have to explore some variables that will help is reduce the logerror.
-
First clone this repo
-
ENV.py file with the following information as it pertains to the SQL network (not part of repo):
- password
- username
- host
-
acquire.py
- Must include
env.py
file in directory. - This file brings in the data from the MySQL Server that the data is stored on
- Must include
-
prep.py
- handles the following:
- Data Types
- Missing Values
- Outliers
- Erroneous columns/data
- create new features
- handles the following:
-
preprocessing.py
- feature engineering
- splits data into train and test
- scale numeric data
-
explore.py
- Functions for:
- finding optimal k value for Kmeans
- elbow plotting
- clustering features
- statistical testing
- visualizations
- Functions for:
- All files necessary to recreate our findings and models
- Report with analysis, clustering and modeling in .ipynb format
- GitHub repo containing all files
T-Test for K of 5 on location clustering
-
$H_0$ = There is no difference between the mean logerror scores for cluster 0 and the overall mean logerror -
$H_0$ = There is no difference between the mean logerror scores for cluster 1 and the overall mean logerror -
$H_0$ = There is no difference between the mean logerror scores for cluster 2 and the overall mean logerror -
$H_0$ = There is no difference between the mean logerror scores for cluster 3 and the overall mean logerror -
$H_0$ = There is no difference between the mean logerror scores for cluster 4 and the overall mean logerror
T-Test for K of 5 on location clustering
-
$H_0$ = There is no difference between the mean logerror scores for cluster 0 and the overall mean logerror -
$H_0$ = There is no difference between the mean logerror scores for cluster 1 and the overall mean logerror -
$H_0$ = There is no difference between the mean logerror scores for cluster 2 and the overall mean logerror -
$H_0$ = There is no difference between the mean logerror scores for cluster 3 and the overall mean logerror -
$H_0$ = There is no difference between the mean logerror scores for cluster 4 and the overall mean logerror -
$H_0$ = There is no difference between the mean logerror scores for cluster 6 and the overall mean logerror
Columns | Definition |
---|---|
bathroomcnt | number of bathrooms |
bedroomcnt | number of bedrooms |
calculatedfinishedsquarefeet | SqFt of total living area |
finishedsquarefeet12 | SqFt of finished living area |
latitude | latitude of the middle of the property |
longitude | longitude of the middle of the property |
lotsizesquarefeet | SqFt of the lot |
yearbuilt | Year home was built |
structuretaxvaluedollarcnt | Assessed value of the home structure |
taxvaluedollarcnt | Assessed home value |
taxamount | tax amount of the home |
logerror | logarithmic error of housing price predictions |
transactiondate | date sold |
extras | describes if home has a garage, pool, or neither |
County | State County the home is located in |
room_count | Combines bathroomcnt and bedroomcnt into one variable |
acres | Gives lot acreage size |
dollar_per_sqft_land | cost of land per sqft |
tax_rate | percentage rate for taxes |
avg_sqft_per_room | average sqft per room |
dollar_per_sqft_home | cost of structure space per sqft |
trans_month | transaction month |
trans_day | transaction day |
- Location clusters provided the most value
- Clustering size did not perform well in modeling
- try to find individual features for each cluster that will help them perform better
- Python (including internal and third party libraries)
- SQL
- Hypothesis testing
- Linear Regression, Kmeans
Link may be found HERE