California_House-Price-Prediction

This is my first end-to-end ML project implementation covering all required stages taking guidance from the book called "Hands-On Machine Learning"

Big Picture

Problem Statement
- Welcome to Machine Learning Housing Corporation!
- Organization Objective: Replacing expensive, time-consuming and less effective manual prediction techniques with Machine Learning
Framing the problem
- Data: California Census Data
- A typical Univariate Multiple Regression task.
- Training Set is labelled hence "Supervised Learning"
- Data is small hence we shall opt for "Batch Learning"
Performance Measure
- Root Mean Square Error (RMSE) - l₂ Norm
- Mean Absolute Percentage Error (MAPE)

No Data Snooping

Get the Data
- Overview and Primary Understanding
Test Set
- Firstly employed Simple Random Sampling to draw a test & train set using Scikit- Learn
- Secondly utilized Stratified Sampling by categorizing the whole datase on median_income
- Later we compared Sampling Bias from both the sampling techniques

Exploratory Data Analysis

Creating Viualizations

There is a celar depiction of clusters in and around San Diego, Los Angeles, San Feancisco, etc.
From the above figure we can see a general figure that ocean_proximity seems to be associated with median_price_value
Bus still there are exceptions in North California, so we've to deploy some feature engineering here as well.
Features such as proximity to clusters can also be checked.

Correlation Matrix and Scatter Plot

In General it shows a strong positive trend.
Straight line at $500,000 reemphasize the price_cap
Concerns are a few straight lines in and around $450,000, $350,000, $280,000, $230,000 and so on in the below.
We may remove the concerned districts.

Data Preparation

Data Cleaning
- Missing value of total_bedrooms (1.01%) has been treated using SimpleImputer Class of Scikit Learn.

Handling Text Attribute
- OneHotEncoding is used to handle ocean_proximity column
Feature Scaling and Transformation Pipeline
- Laid down a single transformation pipeline to transform both numeric and categorical attributes
- StandardScaler(), ColumnTransformer() classes have been utilized.

Model Selection and Training

Model Selection
- Linear Regression, Decision Tree Regressor, and Random Forest Regressor models have been fitted on the training set.
Model Evaluation
- "K-Fold Cross Validation" depicts RSME scores of 68973.97, 69919.68, 50631.51 respectively.
- Random Forest Regressor looks very promising.
- Note: The score on the training set is still much lower than on validation sets, which means still overfitting the training set.

Model Fine Tuning

Grid Search Cross Validation
- Deployed GridSearchCV() to fine-tune hyperparameters
- Got ---> RandomForestRegressor(max_features= 6, n_estimators= 30)
- RSME score slightly improved from 50631.51 to 50586.27 (K Fold Cross Validation)

Evaluation of Test Set

K-Fold Cross Validation (RSME) score of 46811.29 having [44833.99542182, 48708.38642949] confidence interval with a 5% level of Significance.
Mean Absolute Percetage Error(MAPE) score 0.1771.

🛡️ Demonstration Video

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

California_House-Price-Prediction

Table of Contents

Big Picture

No Data Snooping

Exploratory Data Analysis

Data Preparation

Model Selection and Training

Model Fine Tuning

Evaluation of Test Set

Thank You So Much...🙏🙏

Files

README.md

Latest commit

History

README.md

File metadata and controls

California_House-Price-Prediction

Table of Contents

Big Picture

No Data Snooping

Exploratory Data Analysis

Data Preparation

Model Selection and Training

Model Fine Tuning

Evaluation of Test Set

Thank You So Much...🙏🙏