home/modeling/application/python/anomaly detection

Anomaly Detection in Python

Anomaly Detection helps in identifying outliers in a dataset. Various Anomaly Detection techniques have been explored in the theoretical blog Anomaly Detection. In this blog post, we will explore two ways of anomaly detection - One Class SVM and Isolation Forest.

One Class SVM

One Class SVM i.e. One-Class Support Vector Machine is an unsupervised algorithm that learns a decision function to identify outliers. We will be using the Iris dataset which we used for performing clustering.

Adding Libraries

We add some preliminary libraries that will be useful throughout.

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Dataset

Iris dataset is now loaded to detect anomalies in it.

python
from sklearn import datasets
iris = datasets.load_iris()
iris_data = pd.DataFrame(iris.data)
iris_data.columns = iris.feature_names
iris_data['Type']=iris.target
iris_data.head()
iris_data.head() output

The above output is the complete iris dataset. We will be only requiring the independent variables to identify the outliers.

python
iris_X = iris_data[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']]

Importing OneClassSVM Library

We import OneClassSVM from sklearn.svm which will allow us to create a One-Class Support Vector Machine model.

Building and Fitting Model

In the following code, we first declare the value of nu which defines the upper bound of the fraction of outliers. We then initialize the model and fit it on the iris dataset.

python
from sklearn.svm import OneClassSVM
nu = 0.05
ocsvm = OneClassSVM(kernel='rbf', gamma=0.05, nu=nu)
ocsvm.fit(iris_X)
OneClassSVM model fit output

Defining Outliers

We now use the above-created model to identify the outliers in the dataset.

python
pred_ocsvm = ocsvm.predict(iris_X)
pred_ocsvm
pred_ocsvm array output showing 1 and -1 values

All values that are equal to -1 are outliers.

Outlier Observations

We can create a dataset only having the outlier observations.

python
X_outliers = iris_X[ocsvm.predict(iris_X) == -1]
X_outliers
Outlier observations identified by OneClassSVM

The following observations are considered as outliers by the One-Class SVM.

Isolation Forest

Isolation Forest is an effective and more efficient means of detecting anomalies in a dataset. It isolates observations by randomly selecting a feature and randomly selecting split values between maximum and minimum values of the selected feature. This repeated partitioning can be represented as trees, and hence comes the concept of random decision trees. We will fit the model on the iris dataset and predict the outliers.

Importing IsolationForest Library

We import IsolationForest from sklearn.ensemble which will allow us to create an Isolation Forest model.

Building and Fitting Model

We now build an Isolation Forest model and fit it on the Iris dataset.

python
from sklearn.ensemble import IsolationForest
iforest = IsolationForest(n_estimators=300, contamination=0.10)
iforest = iforest.fit(iris_X)

Defining Outliers

We now use the above-created model to identify the outliers in the dataset.

python
pred_isoF = iforest.predict(iris_X)
pred_isoF
pred_isoF array output showing 1 and -1 values

Here values equal to -1 are the outliers.

Outlier Observations

We create a dataset only having the outlier observations.

python
isoF_outliers = iris_X[iforest.predict(iris_X) == -1]
isoF_outliers
Outlier observations identified by Isolation Forest

The above observations are termed as outliers by our Isolation Forest model.

In this blog post, we used Python to create models that help us in identifying anomalies in the data in an unsupervised environment. We have created the same models using R and this has been shown on the blog - Anomaly Detection in R.

ESC
100 pages indexed · Esc to close