Skip to content

Latest commit

 

History

History
119 lines (79 loc) · 5.01 KB

File metadata and controls

119 lines (79 loc) · 5.01 KB

California_House-Price-Prediction

This is my first end-to-end ML project implementation covering all required stages taking guidance from the book called "Hands-On Machine Learning"

Table of Contents

Big Picture

  1. Problem Statement

    • Welcome to Machine Learning Housing Corporation!
    • Organization Objective: Replacing expensive, time-consuming and less effective manual prediction techniques with Machine Learning
  2. Framing the problem

    • Data: California Census Data
    • A typical Univariate Multiple Regression task.
    • Training Set is labelled hence "Supervised Learning"
    • Data is small hence we shall opt for "Batch Learning"
  3. Performance Measure

    • Root Mean Square Error (RMSE) - l 2 Norm
    • Mean Absolute Percentage Error (MAPE)

No Data Snooping

  1. Get the Data
    • Overview and Primary Understanding
  2. Test Set
    • Firstly employed Simple Random Sampling to draw a test & train set using Scikit- Learn
    • Secondly utilized Stratified Sampling by categorizing the whole datase on median_income
    • Later we compared Sampling Bias from both the sampling techniques

Exploratory Data Analysis

  1. Creating Viualizations

Screenshot from 2024-06-06 10-24-13


  • There is a celar depiction of clusters in and around San Diego, Los Angeles, San Feancisco, etc.
  • From the above figure we can see a general figure that ocean_proximity seems to be associated with median_price_value
  • Bus still there are exceptions in North California, so we've to deploy some feature engineering here as well.
  • Features such as proximity to clusters can also be checked.

  1. Correlation Matrix and Scatter Plot

Screenshot from 2024-06-06 19-53-02


  • In General it shows a strong positive trend.
  • Straight line at $500,000 reemphasize the price_cap
  • Concerns are a few straight lines in and around $450,000, $350,000, $280,000, $230,000 and so on in the below.
  • We may remove the concerned districts.

Data Preparation

  1. Data Cleaning

    • Missing value of total_bedrooms (1.01%) has been treated using SimpleImputer Class of Scikit Learn.

Screenshot from 2024-06-09 11-19-36


  1. Handling Text Attribute

    • OneHotEncoding is used to handle ocean_proximity column
  2. Feature Scaling and Transformation Pipeline

    • Laid down a single transformation pipeline to transform both numeric and categorical attributes
    • StandardScaler(), ColumnTransformer() classes have been utilized.

Model Selection and Training

  1. Model Selection

    • Linear Regression, Decision Tree Regressor, and Random Forest Regressor models have been fitted on the training set.
  2. Model Evaluation

    • "K-Fold Cross Validation" depicts RSME scores of 68973.97, 69919.68, 50631.51 respectively.
    • Random Forest Regressor looks very promising.
    • Note: The score on the training set is still much lower than on validation sets, which means still overfitting the training set.

Model Fine Tuning

  1. Grid Search Cross Validation

    • Deployed GridSearchCV() to fine-tune hyperparameters
    • Got ---> RandomForestRegressor(max_features= 6, n_estimators= 30)
    • RSME score slightly improved from 50631.51 to 50586.27 (K Fold Cross Validation)

Screenshot from 2024-06-16 12-25-50


Evaluation of Test Set

  • K-Fold Cross Validation (RSME) score of 46811.29 having [44833.99542182, 48708.38642949] confidence interval with a 5% level of Significance.
  • Mean Absolute Percetage Error(MAPE) score 0.1771.

🛡️ Demonstration Video


Thank You So Much...🙏🙏