Financial-Fraud-Detection

Overview

A fraud detection model built using random forests and rule based classifier. It detects fraud in financial transactions based on predictive features such as transaction amount, payer type (customer/business), reciever type (customer/business), reciever account balance and payer account balance. The model achieves a validation accuracy of about 100% on the given validation data.

About the dataset

Openly available on Kaggle: Fraudulent Transactions Data size of the dataset: 6,362,620 rows and 10 columns The dataset is significantly imbalanced with respect to the target variable isFraud by a margin of 99.1% : 0.9% ratio for positive and negative class respectively Data Dictionary:

step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).Data type: int64
type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER. data type: object
amount - amount of the transaction in local currency. data type: float64
nameOrig - customer who started the transaction. data type: object
oldbalanceOrg - initial balance before the transaction. data type: float64
newbalanceOrig - new balance after the transaction. data type: float64
nameDest - customer who is the recipient of the transaction. data type: object
oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants). data type: float64
newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants). data type: float64
isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system. data type: int64

Libraries used:

Scikit-Learn: Scikit-learn is a machine learning library for Python that provides simple and efficient tools for data mining and data analysis, including classification, regression, clustering, and dimensionality reduction. Used in the project for performing soplitting and validation data, Label encoding categorical features, building and training Random forest classifier and evaluating it using Confusion matrix and ROC-AUC curve .
Pandas: A powerful data manipulation and analysis library for Python, providing data structures like DataFrames, Series, Sparse, Panel and Collection to efficiently handle and analyze large datasets. Used in the project for data cleaning, preprocessing, feature engineering, and manipulation of the transaction data to prepare it for modeling.
Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations in Python. Used in the project for plotting and visualizing the distribution of variables, trends, and relationships in the data to gain insights and communicate findings.
Seaborn: A statistical data visualization library based on Matplotlib that provides a high-level interface for drawing attractive and informative graphics. Used in the project to visualize the confusion matrix, and study multicollinearity between variables using pair plots, heatmaps, and box plots.

Methodology:

This project went through various stages throughout its entire lifecycle which are as follows:

Data retrieval: extracting the dataset and storing it in a dataframe
Data visualization and comprehension: includes study of various features of the data set and their effect on the final model.
Data Wrangling and Feature engineering: includes extracting features from existing features, eliminating irrelevant features, detecting outliers and detecting missing values
Data Modeling: involves splitting the dataset into training and validation sets, defining the feature set and the target variable, hyperparameter tuning and fitting the parameters like weights and biases to training the model.
Model Evalution: involves plotting the confusion matrix and the ROC-AUC curve, and printing the classification report.

Features of the model:

Ensemble of multiple learners: It is a bagging ensemble of 15 decision trees
Information gain: It uses entropy/information-gain as heuristic to split at a node of each decision tree
rule-based bias: The model is biased to classify each of those transactions as fraud where the amount is > 200,000
Accuracy: The model achieves a validation accuracy of 100% on the given validation data.

Evaluating Model Performance:

Classification Report:
Confusion matrix:
ROC-AUC Curve:

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
fraud_detection.ipynb		fraud_detection.ipynb
fraud_detection_forest.pkl		fraud_detection_forest.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Financial-Fraud-Detection

Overview

About the dataset

Libraries used:

Methodology:

Features of the model:

Evaluating Model Performance:

About

Releases

Packages

Languages

ISHOOO/Financial-Fraud-Detection

Folders and files

Latest commit

History

Repository files navigation

Financial-Fraud-Detection

Overview

About the dataset

Libraries used:

Methodology:

Features of the model:

Evaluating Model Performance:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages