statistical-models-car-insurance

This repo contains the source code of my project carried out during my academic experience at the University of Bologna and subsequently updated to improve the results.

Goal

The primary goal of the project is to estimate the pure premium of an auto insurance portfolio and create tariff classes to identify policyholders' risk factors.

Data and Tools

Data comes from the R CASdatasets package, a collection of datasets originally for the book "Computational Actuarial Science with R" edited by Arthur Charpentier.

In particular, the datasets used are freMTPLfreq, which contains the risk characteristics and the number of claims per policy (413,169 policies), and freMTPLsev, which contains the claim amount and the corresponding policy ID.

Summary of the project

Data preparation and EDA:

Converted numerical features into categorical/ordinal (age of the driver, age of the car, population density) to create tariff risk classes
Removed policies with large claims (100th percentile of the distribution of the claim amount variable)
Exploratory analysis of features and outcomes

Modeling:

Following the actuarial practice, the pure premium is obtained by multiplying two components, the estimated claim frequency and cost.

Therefore, two models are estimated separately, one for the claim frequency and one for the average claim amount (severity).

As the premium and the new tariff classes will also have to be applied to future policies, Cross Validation techniques are used to select the most relevant features and the most accurate predictive models.

The analysis is structured in the following steps:

Splitting data into training and test set
Feature selection using the Random Forest algorithm
Comparison of different counting data models (Poisson Regression, Negative Binomial, Zero-Inflated and Hurdle models), to predict claim frequency. The models are evaluated using the Dawid-Sebastiani scoring rule on test sample (the lower the better)
Gamma Regression to predict claim severity

Pricing and Relativities calculation:

After choosing the best models using the test sample, they are fitted on the full dataset to calculate the pure premium and then the accuracy of this prediction is evaluated with respect to the observed data, using MAE and RMSE.

The coefficients estimated on the full sample of the two chosen models are exponentiated, and the relativities of the risk factors are calculated from these values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

statistical-models-car-insurance

Goal

Data and Tools

Summary of the project

Data preparation and EDA:

Modeling:

Pricing and Relativities calculation:

Files

README.md

Latest commit

History

README.md

File metadata and controls

statistical-models-car-insurance

Goal

Data and Tools

Summary of the project

Data preparation and EDA:

Modeling:

Pricing and Relativities calculation: