Skip to content

An exploration of anomaly detection for insider risk implemented by KDE, Minhash, and K-Means, based on PySpark and Colab.

License

Notifications You must be signed in to change notification settings

waittim/Insider-Risk-in-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Insider Risk Detection in PySpark

Introduction

This repo contains the exploration of anomaly detection for insider risk implemented by Kernel Density Estimation (KDE), MinHash and K-Means. The implementation is based on PySpark-3.1.1 and Google Colab.

We implemented probability-based risk estimation for numerical features by KDE. And we implemented the detection of anomalous email contents by MinHash and K-Means.

Dataset

The Insider Threat Test Dataset, which is provided by the CERT Division, is a collection of synthetic insider threat test datasets that provide both background and malicious actor synthetic data. It contains 1000 users, 17 months long.

For more background on this data, please see the paper, Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data.

Usage

  1. Please download the dataset from CMU kilthub and unzip them. Then put the CSV files into the folder ./data/.
  2. For KDE based method, please open KDE_risk.ipynb and follow the introduction inside.
  3. For Minhash & K-means based method, please open Kmeans_email.ipynb and follow the introduction inside.

Others

Because of the limitation of Colab, we cannot call the customized Spark backend. Therefore, the notebook email_IF.ipynb, which tries to apply the Isolation Forest algorithm, can not work successfully yet.

If you have any ideas, please tell me in Issues, thank you!

About

An exploration of anomaly detection for insider risk implemented by KDE, Minhash, and K-Means, based on PySpark and Colab.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published