A fraud detection model built using random forests and rule based classifier. It detects fraud in financial transactions based on predictive features such as transaction amount, payer type (customer/business), reciever type (customer/business), reciever account balance and payer account balance. The model achieves a validation accuracy of about 100% on the given validation data.
Openly available on Kaggle: Fraudulent Transactions Data
size of the dataset: 6,362,620 rows and 10 columns
The dataset is significantly imbalanced with respect to the target variable isFraud
by a margin of 99.1% : 0.9% ratio for positive and negative class respectively
Data Dictionary:
- step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).Data type: int64
- type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER. data type: object
- amount - amount of the transaction in local currency. data type: float64
- nameOrig - customer who started the transaction. data type: object
- oldbalanceOrg - initial balance before the transaction. data type: float64
- newbalanceOrig - new balance after the transaction. data type: float64
- nameDest - customer who is the recipient of the transaction. data type: object
- oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants). data type: float64
- newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants). data type: float64
- isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system. data type: int64
-
Scikit-Learn: Scikit-learn is a machine learning library for Python that provides simple and efficient tools for data mining and data analysis, including classification, regression, clustering, and dimensionality reduction. Used in the project for performing soplitting and validation data, Label encoding categorical features, building and training Random forest classifier and evaluating it using Confusion matrix and ROC-AUC curve .
-
Pandas: A powerful data manipulation and analysis library for Python, providing data structures like DataFrames, Series, Sparse, Panel and Collection to efficiently handle and analyze large datasets. Used in the project for data cleaning, preprocessing, feature engineering, and manipulation of the transaction data to prepare it for modeling.
-
Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations in Python. Used in the project for plotting and visualizing the distribution of variables, trends, and relationships in the data to gain insights and communicate findings.
-
Seaborn: A statistical data visualization library based on Matplotlib that provides a high-level interface for drawing attractive and informative graphics. Used in the project to visualize the confusion matrix, and study multicollinearity between variables using pair plots, heatmaps, and box plots.
This project went through various stages throughout its entire lifecycle which are as follows:
-
Data retrieval: extracting the dataset and storing it in a dataframe
-
Data visualization and comprehension: includes study of various features of the data set and their effect on the final model.
-
Data Wrangling and Feature engineering: includes extracting features from existing features, eliminating irrelevant features, detecting outliers and detecting missing values
-
Data Modeling: involves splitting the dataset into training and validation sets, defining the feature set and the target variable, hyperparameter tuning and fitting the parameters like weights and biases to training the model.
-
Model Evalution: involves plotting the confusion matrix and the ROC-AUC curve, and printing the classification report.
- Ensemble of multiple learners: It is a bagging ensemble of 15 decision trees
- Information gain: It uses entropy/information-gain as heuristic to split at a node of each decision tree
- rule-based bias: The model is biased to classify each of those transactions as fraud where the amount is > 200,000
- Accuracy: The model achieves a validation accuracy of 100% on the given validation data.