Skip to content

Commit

Permalink
Merge pull request #9 from MaiRajborirug/master
Browse files Browse the repository at this point in the history
Extension of Outlier Detection Comparison in 3-D dataset
  • Loading branch information
bdpedigo authored Dec 19, 2019
2 parents 34dddd1 + 7f5a786 commit 9ce5e37
Show file tree
Hide file tree
Showing 2 changed files with 841 additions and 0 deletions.
457 changes: 457 additions & 0 deletions Plot_anomaly_comparison_3D.ipynb

Large diffs are not rendered by default.

384 changes: 384 additions & 0 deletions plot_anomaly_comparison-3D.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,384 @@
"""
============================================================================
Comparing anomaly detection algorithms for outlier detection on 3D toy datasets
============================================================================
Brief overview
---------------------------------
This example is an extension of anomaly detection comparison on 2D dataset
from ``sckit-learn``. These experiment applied four algorithms
on five 3D toy datasets. Four algorithms are in sklearn, which are
``sklearn.covariance.EllipticEnvelope``,
``sklearn.svm.OneClassSVM``, ``sklearn.ensemble.IsolationForest``,
and ``sklearn.neighbors.LocalOutlierFactor``.
The outlier detecting performance is measured using
accuracy from `sklearn.metrics.accuracy_score`
and AUC score from `sklearn.metrics.roc_auc_score`.
Data simulation
---------------------------------
Here are the dafault parameters of the data set,
which user can adjust the values:
* ``n_samples = 500`` - the number of sample
* ``outlliers_fraction = 0.15`` - ratio of outliers to all sample.
In this experiment, the outliers have uniform distribution.
* ``d_noise = 10`` - the number of Gaussian dimensional noise dimension.
* Modern brain imaging techniques apply functional data analysis (FDA),
which views curves and hypersurfaces infinite-dimensional variables.
However, most of variables do not contain information,
and we refer them as ``noise dimensions``.
* ``line_sd = 0.06`` - the standard deviation (sd) of
the linear and helix datasets
* ``cluster_sd = 0.2`` - sd of two Gaussian mixture datasets
* ``noise_sd = 2`` - sd of each variable in dimensional noise
The data has three information dimension, which is followed figure
1 in the paper:
* Madhyastha, Meghana, et al. "Geodesic Learning via Unsupervised
Decision Forests." arXiv preprint arXiv:1907.02844 (2019).
There are five datasets:
* Linear
* Spiral helix
* Sphere shell
* Three aligned gaussian mixture
* Three misaligned gaussian mixture
Defin algorithms
---------------------------------
There are four outler detection algorithms in this experiment:
1. Robust covariance, ``sklearn.covariance.EllipticEnvelope``
2. One-Class SVM, ``sklearn.svm.OneClassSVM``
* use ``gamma = 'scale'`` because it gives the better prediction in this experiment
3. Isolation Forest, ``sklearn.ensemble.IsolationForest`
4. Local Outlier Factor, ``sklearn.neighbors.LocalOutlierFactor``
Generate 3D plot
---------------------------------
Even though the data `X` has `3+d_noise` dimensions, we represent only
the three information dimensions in the 3D plots.
We add dimensional noise into the algorithm in this part.
Performance measurement
---------------------------------
Each algorithm will predict the outliers and inliers using
`algorithm.fit(X).predict(X)`. There are two main performance measurement
in this example.
* `sklearn.metrics.accuracy_score`: measure the match between
`y_true` and `y_predict`.
* `sklearn.metrics.roc_auc_score`:
measure the area under receiver operating characteristic (ROC) curve.
* `sklearn.metrics.roc_curve`: show the shape of ROC curve
and the thresholds. The data on the left of the thresholds are inliers
and on the right are outliers.
Result and disscussion
---------------------------------
The figures show the outlier detection performance and visualization.
Each row represents different datasets. The first four columns compare
each algorithm the computation time (``.__ s``) and
outlier prediction accuracy (``acc``).
The number and name of each outlier detection algorithm are on
the top of the column. The last column plots all four algorithms in
the ROC curve compare AUC score. The number label on AUC score matches
the number in front of the algorithm names. The ``x`` in the ROC curves
indicate the thresholds where algorithms start to classify data as outliers.
From the plots, ``sklearn.covariance.EllipticEnvelope`` shows best result
in high dimensional noise ``d_noise = 10``.
However, since robust covariance creates a ellptical envelope for inliers,
we need more test on an inlier data that is not in a elliptical shape.
"""

# Graphing and calculation packages
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
import time
import matplotlib
from sklearn.datasets import make_moons, make_blobs

# Algorithm packages
from sklearn import svm
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

# Performance measurement packages
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Data parameter
d_noise = 10 # number of Gaussian dimensional noise
n_samples = 500
outliers_fraction = 0.15
n_outliers = int(outliers_fraction * n_samples)
n_inliers = n_samples - n_outliers
line_sd = 0.06 # sd of the line in dataset 1 and 2
cluster_sd = 0.2 # sd of the cluster in dataset 4 and 5
noise_sd = 2 # sd of the cluster in dimensional noise

# Data parameter
d_noise = 10 # number of Gaussian dimensional noise
n_samples = 500
outliers_fraction = 0.15
n_outliers = int(outliers_fraction * n_samples)
n_inliers = n_samples - n_outliers
line_sd = 0.06 # sd of the line in dataset 1 and 2
cluster_sd = 0.2 # sd of the cluster in dataset 4 and 5
noise_sd = 2 # sd of the cluster in dimensional noise

# Define datasets
## 1: Linear
def fun_linear(samples=1, sd=0.0):
t_lin = np.transpose(np.linspace(-1, 1, samples))
X_lin = np.c_[
0.4 * t_lin + sd * np.random.randn(samples),
0.6 * t_lin + sd * np.random.randn(samples),
t_lin + sd * np.random.randn(samples),
]
return X_lin


X_lin = fun_linear(samples=n_inliers, sd=line_sd)

## 2: Helix
def fun_helix(samples=1, sd=0.0):
t_hex = np.transpose(np.linspace(2 * np.pi, 9 * np.pi, samples))
xline = t_hex * np.cos(t_hex) # before rescale
xline = xline / (max(xline) - min(xline)) * 2 + sd * np.random.randn(samples)
yline = t_hex * np.sin(t_hex) # before rescale
yline = yline / (max(yline) - min(yline)) * 2 + sd * np.random.randn(samples)
zline = (t_hex - (max(t_hex) + min(t_hex)) / 2) / (
max(t_hex) - min(t_hex)
) * 2 + sd * np.random.randn(samples)
X_hex = np.c_[xline, yline, zline]
return X_hex


X_hex = fun_helix(samples=n_inliers, sd=line_sd)

## 3: Sphere, equally distribution
def fibonacci_sphere(samples=1, randomize=True):
rnd = 1.0
if randomize:
rnd = np.random.random() * samples
points = []
offset = 2.0 / samples
increment = np.pi * (3.0 - np.sqrt(5.0))
for i in range(samples):
y = ((i * offset) - 1) + (offset / 2)
r = np.sqrt(1 - pow(y, 2))
phi = ((i + rnd) % samples) * increment
x = np.cos(phi) * r
z = np.sin(phi) * r
points.append([x, y, z])
return points


X_sph = np.array(fibonacci_sphere(samples=n_inliers))

## 4: Gaussian Mixture
def gaussian_blobs(samples=3, sd=0.0):
blobs_params = dict(random_state=0, n_samples=samples, n_features=3)
X_gau = make_blobs(
centers=[[-0.7, -0.7, -0.7], [0, 0, 0], [0.7, 0.7, 0.7]],
cluster_std=[sd, sd, sd],
**blobs_params
)[0]
return X_gau


X_gau = gaussian_blobs(samples=n_inliers, sd=cluster_sd)

## 5: Misaligned Gaussian Mixture
def misaligned_blobs(samples=3, sd=0.0):
blobs_params = dict(random_state=0, n_samples=samples, n_features=3)
X_misaligned = make_blobs(
centers=[[-0.7, -0.7, -0.7], [0.7, 0.7, -0.7], [-0.7, 0.7, 0.7]],
cluster_std=[sd, sd, sd],
**blobs_params
)[0]
return X_misaligned


X_misaligned = misaligned_blobs(samples=n_inliers, sd=cluster_sd)

## 6: Whole dataset
datasets3D = [X_lin, X_hex, X_sph, X_gau, X_misaligned]

## define to data label: y_true
y_true = np.concatenate([np.ones(n_inliers), -np.ones(n_outliers)], axis=0)
## label 1 as inliers, -1 as outliers

# Define algorithm to be compared
anomaly_algorithms = [
("Robust covariance", EllipticEnvelope(contamination=outliers_fraction)),
(
"One-Class SVM",
svm.OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma="scale"),
),
(
"Isolation Forest",
IsolationForest(
n_estimators=500,
behaviour="new",
contamination=outliers_fraction,
random_state=42,
),
),
(
"Local Outlier Factor",
LocalOutlierFactor(
n_neighbors=35, contamination=outliers_fraction, novelty=True
),
),
]

# Plot
plt.figure(figsize=((len(anomaly_algorithms) + 1) * 2.5 + 1, len(datasets3D) * 2.5 + 1))
plt.subplots_adjust(
left=0.02, right=0.98, bottom=0.001, top=0.98, wspace=0.05, hspace=0.01
)
plot_num = 1
rng = np.random.RandomState(42)
for i_dataset3D, X in enumerate(datasets3D):
# add uniform distribution outliers
X = np.concatenate(
[X, rng.uniform(low=-1.5, high=1.5, size=(n_outliers, 3))], axis=0
)
# add Gaussian dimensional noise, set the center at origin
X_noise = make_blobs(
n_samples=n_samples,
centers=1,
n_features=d_noise,
random_state=0,
cluster_std=1,
center_box=(0.0, 0.0),
)[0]
X = np.append(X, X_noise, axis=1)

# list of AUC and ROC
list_AUC = []
list_fpr = []
list_tpr = []
list_thresh = []

algo_index = 0
for name, algorithm in anomaly_algorithms:
t0 = time.time()
algorithm.fit(X)

# fit the data and tag outliers
y_pred = algorithm.fit(X).predict(X)

# outlier detection performance
acc = accuracy_score(y_true, y_pred)
probas_ = algorithm.fit(X).decision_function(X)
AUC = roc_auc_score(y_true, probas_)
fpr, tpr, thresholds = roc_curve(y_true, probas_)
thresh_index = np.where(abs(thresholds) == min(abs(thresholds)))[0][0]

# store the measurement for the plot
list_AUC.append(AUC)
list_fpr.append(fpr)
list_tpr.append(tpr)
list_thresh.append(thresh_index)

t1 = time.time()

# add data plot
ax = plt.subplot(
len(datasets3D), len(anomaly_algorithms) + 1, plot_num, projection="3d"
)
ax.axis("on")
if i_dataset3D == 0:
plt.title(
str(algo_index + 1) + ". " + name, size=15, color="black", weight="bold"
) # use function's name for a title
colors = np.array(["#377eb8", "#ff7f00"])
# color plot ('blue' = outlier, 'orange'=inlier)
ax.scatter3D(X[:, 0], X[:, 1], X[:, 2], s=5, color=colors[((y_pred + 1) // 2)])
ax.text2D(
0.01,
0.85,
("acc %.3f" % acc).lstrip("0"),
transform=plt.gca().transAxes,
size=15,
horizontalalignment="left",
)
ax.text2D(
0.99,
0.01,
("%.2fs" % (t1 - t0)).lstrip("0"),
transform=plt.gca().transAxes,
size=15,
horizontalalignment="right",
)
ax.set_xticks([])
ax.set_yticks([])
ax.zaxis.set_ticklabels([])
algo_index += 1
plot_num += 1
# add ROC plot
ax = plt.subplot(len(datasets3D), len(anomaly_algorithms) + 1, plot_num)
if i_dataset3D == 0:
plt.title("ROC", size=15, color="black", weight="bold")
# use function's name for a title
for algo_index in range(len(anomaly_algorithms)):
# plot ROC and threshold point
# many if and else is to clean the legend
if i_dataset3D == 0:
ax.plot(
list_fpr[algo_index],
list_tpr[algo_index],
label="algo "
+ str(algo_index + 1)
+ (" AUC %.2f" % list_AUC[algo_index]).lstrip("0"),
)
if algo_index == 0:
ax.scatter(
list_fpr[algo_index][list_thresh[algo_index]],
list_tpr[algo_index][list_thresh[algo_index]],
s=40,
edgecolor="yellow",
marker="x",
color="black",
label="thresholds",
)
else:
ax.scatter(
list_fpr[algo_index][list_thresh[algo_index]],
list_tpr[algo_index][list_thresh[algo_index]],
s=40,
edgecolor="yellow",
marker="x",
color="black",
)
else:
ax.plot(
list_fpr[algo_index],
list_tpr[algo_index],
label=str(algo_index + 1)
+ (" AUC %.2f" % list_AUC[algo_index]).lstrip("0"),
)
ax.scatter(
list_fpr[algo_index][list_thresh[algo_index]],
list_tpr[algo_index][list_thresh[algo_index]],
s=40,
edgecolor="yellow",
marker="x",
color="black",
)
ax.plot(np.array([0, 1]), np.array([0, 1]), linestyle="--", color="black")
# show the legend
ax.legend()
ax.xaxis.set_ticklabels([])
ax.yaxis.set_ticklabels([])
plot_num += 1
print("d_noise = ", str(d_noise))
plt.show()

0 comments on commit 9ce5e37

Please sign in to comment.