This repo contains the source code of my project carried out during my academic experience at the University of Bologna and subsequently updated to improve the results.
The primary goal of the project is to estimate the pure premium of an auto insurance portfolio and create tariff classes to identify policyholders' risk factors.
Data comes from the R CASdatasets
package, a collection of datasets originally for the book "Computational Actuarial Science with R" edited by Arthur Charpentier.
In particular, the datasets used are freMTPLfreq
, which contains the risk characteristics and the number of claims per policy (413,169 policies), and freMTPLsev
, which contains the claim amount and the corresponding policy ID.
-
Converted numerical features into categorical/ordinal (age of the driver, age of the car, population density) to create tariff risk classes
-
Removed policies with large claims (100th percentile of the distribution of the claim amount variable)
-
Exploratory analysis of features and outcomes
Following the actuarial practice, the pure premium is obtained by multiplying two components, the estimated claim frequency and cost.
Therefore, two models are estimated separately, one for the claim frequency and one for the average claim amount (severity).
As the premium and the new tariff classes will also have to be applied to future policies, Cross Validation techniques are used to select the most relevant features and the most accurate predictive models.
The analysis is structured in the following steps:
-
Splitting data into training and test set
-
Feature selection using the Random Forest algorithm
-
Comparison of different counting data models (Poisson Regression, Negative Binomial, Zero-Inflated and Hurdle models), to predict claim frequency. The models are evaluated using the Dawid-Sebastiani scoring rule on test sample (the lower the better)
-
Gamma Regression to predict claim severity
After choosing the best models using the test sample, they are fitted on the full dataset to calculate the pure premium and then the accuracy of this prediction is evaluated with respect to the observed data, using MAE and RMSE.
The coefficients estimated on the full sample of the two chosen models are exponentiated, and the relativities of the risk factors are calculated from these values.