The goal of this project is to predict whether an existing health insurance customer will buy a vehicle insurance for the next year, provided by the same company, based on various features like demographics and previous insurance details. This repository is dedicated to solving the Health Insurance Cross-Sell Prediction problem using the dataset from Kaggle. The dataset can be found here.
The dataset provides a variety of features related to the insured person, such as:
- Age
- Gender
- Region Code
- Policy Sales Channel
- Driving License
- Vehicle Age
- Annual Premium
I'll aim to develop a Machine Learning model that can predict the target variable Response
(1: Will buy insurance, 0: Will not buy insurance).
The dataset consists of one CSV file:
dataset.csv
: The full dataset with features related to health insured persons.
The target variable in the dataset is Response
, which indicates whether the customer will purchase insurance.
Column Name | Description |
---|---|
id |
Unique identifier for each customer |
Gender |
Gender of the customer |
Age |
Age of the customer |
Driving_License |
0: Customer does not have DL, 1: Customer has DL |
Region_Code |
Unique code for the region of the customer |
Previously_Insured |
0: Customer does not have insurance, 1: Customer has insurance |
Vehicle_Age |
Age of the customer’s vehicle |
Vehicle_Damage |
1: Customer has damaged the vehicle, 0: Customer has not damaged the vehicle |
Annual_Premium |
The premium amount for insurance |
Policy_Sales_Channel |
Channel through which the policy was sold |
Vintage |
Number of days the customer has been associated with the company |
Response |
1: Will buy insurance, 0: Will not buy insurance |
- Handling Missing Data: Identify and handle any missing data.
- Feature Engineering: Analyze categorical and numerical features to create additional informative features.
- Normalization/Scaling: Normalize or scale the numerical features to prepare them for model training.
- Perform descriptive statistics and visualization to understand data distributions, correlations, and patterns.
- Analyze class imbalance in the target variable and address it if necessary.
- Model Selection: We will experiment with multiple models including:
- Logistic Regression
- Decision Trees
- Random Forests
- Gradient Boosting Machines (XGBoost, LightGBM)
- Neural Networks
- Hyperparameter Tuning: Optimize model parameters using cross-validation techniques.
- Evaluation Metrics: Accuracy, ROC-AUC, F1-Score, Precision, and Recall will be used to evaluate the model's performance.
- Finalize the best-performing model and save it for potential deployment.
- Explore model interpretability and SHAP values to understand the features driving predictions.
The code is written in Python and requires the following libraries:
- Pandas
- Requests
- NumPy
- Scikit-learn
- Matplotlib
- Seaborn
-
Clone the repository:
git clone https://github.com/HugoTex98/health-insurance-cross-sell-prediction.git cd health-insurance-cross-sell-prediction
-
Create a virtual environment and activate it:
python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
-
Download the dataset from Kaggle and place it in
/dataset
directory.
- Run the Notebook:
The main file of this project is a Jupyter Notebook that contains all steps from data loading, exploratory data analysis (EDA), model training, and evaluation. To run the notebook:
```bash
jupyter notebook notebooks/Health_Insurance_Cross_Sell_Prediction.ipynb
```
-
Follow Along in the Notebook:
Open the notebook in your browser, and run each cell sequentially. The notebook will guide you through:
- Data loading and preprocessing
- Exploratory Data Analysis (EDA)
- Feature engineering
- Model training and evaluation
- Making predictions on new data
-
Modifying the Notebook:
If you wish to experiment with the model, adjust parameters, or apply different techniques, you can modify the cells in the notebook. Simply rerun the relevant sections after making changes.
-
Save Results:
Any outputs such as plots, metrics, or predictions will be generated within the notebook. If you'd like to save any specific results (e.g., predictions), follow the instructions in the relevant notebook section.
- Kaggle for providing the dataset.