This is my first end-to-end ML project implementation covering all required stages taking guidance from the book called "Hands-On Machine Learning"
- Big Picture
- No Data Snooping
- Exploratory Data Analysis
- Data Preparation
- Model Selection and Training
- Model Fine Tuning
- Evaluation on Test Set
-
Problem Statement
- Welcome to Machine Learning Housing Corporation!
- Organization Objective: Replacing expensive, time-consuming and less effective manual prediction techniques with Machine Learning
-
Framing the problem
- Data: California Census Data
- A typical Univariate Multiple Regression task.
- Training Set is labelled hence "Supervised Learning"
- Data is small hence we shall opt for "Batch Learning"
-
Performance Measure
- Root Mean Square Error (RMSE) - l 2 Norm
- Mean Absolute Percentage Error (MAPE)
- Get the Data
- Overview and Primary Understanding
- Test Set
- Firstly employed Simple Random Sampling to draw a test & train set using
Scikit- Learn
- Secondly utilized Stratified Sampling by categorizing the whole datase on
median_income
- Later we compared Sampling Bias from both the sampling techniques
- Firstly employed Simple Random Sampling to draw a test & train set using
- Creating Viualizations
- There is a celar depiction of clusters in and around San Diego, Los Angeles, San Feancisco, etc.
- From the above figure we can see a general figure that
ocean_proximity
seems to be associated withmedian_price_value
- Bus still there are exceptions in North California, so we've to deploy some feature engineering here as well.
- Features such as proximity to clusters can also be checked.
- Correlation Matrix and Scatter Plot
- In General it shows a strong positive trend.
- Straight line at $500,000 reemphasize the price_cap
- Concerns are a few straight lines in and around
$450,000
,$350,000
,$280,000
,$230,000
and so on in the below. - We may remove the concerned districts.
-
Data Cleaning
- Missing value of
total_bedrooms
(1.01%) has been treated using SimpleImputer Class of Scikit Learn.
- Missing value of
-
Handling Text Attribute
- OneHotEncoding is used to handle
ocean_proximity
column
- OneHotEncoding is used to handle
-
Feature Scaling and Transformation Pipeline
- Laid down a single transformation pipeline to transform both numeric and categorical attributes
- StandardScaler(), ColumnTransformer() classes have been utilized.
-
Model Selection
Linear Regression
,Decision Tree Regressor
, andRandom Forest Regressor
models have been fitted on the training set.
-
Model Evaluation
- "K-Fold Cross Validation" depicts RSME scores of 68973.97, 69919.68, 50631.51 respectively.
- Random Forest Regressor looks very promising.
- Note: The score on the training set is still much lower than on validation sets, which means still overfitting the training set.
-
Grid Search Cross Validation
- Deployed
GridSearchCV()
to fine-tune hyperparameters - Got --->
RandomForestRegressor(max_features= 6, n_estimators= 30)
- RSME score slightly improved from 50631.51 to 50586.27 (K Fold Cross Validation)
- Deployed
- K-Fold Cross Validation (RSME) score of
46811.29
having[44833.99542182, 48708.38642949]
confidence interval with a 5% level of Significance. - Mean Absolute Percetage Error(MAPE) score
0.1771
.