In this repo I trained baseline classifiers for different fraud detection tasks. The fist task was credit card fraud detection on the dataset called Paysim. In addition, two popular auto insurance fraud detection datasets were analyzed: the example of 1000 samples used in a Databricks notebook and the Oracle's example for outlier detection containing 14700 samples. This latter dataset has been also used in some scientific work (DeBarr, et al.; Nian, et al.), and the details of its features are described in some detail by Phua, et al., who shared the dataset at his web site.
The examples I share in this repo use PySpark and they are prepared for large CSV file processing in standalone mode. In particular, the use of the spark.ml
module was favored as the RDD-based MLLIB library is going to be deprecated.
The Databricks dataset is mainly focused on the classification of the claims while the Oracle dataset is mainly focused on the classification of policy holders with fraudulent behavior. This makes a very interesting ensemble of features.