Anomaly Detection in R

Anomaly Detection helps in identifying outliers in a dataset. Various Anomaly Detection techniques have been explored in the theoretical blog Anomaly Detection. In this blog post, we will explore two ways of anomaly detection - Kernel Density and One Class SVM.

Kernel Density Estimation

Kernel Density Estimation is a method to detect outliers/anomalies in a dataset. We make use of the kernel density estimates and decide the benchmark for outliers. This has been discussed in detail in the theoretical blog Anomaly Detection.

Importing Dataset

To demonstrate Kernel Density Estimation (KDE) in R, the Iris dataset will be used which has three classes in the dependent variable (three type of Iris flowers).

library(MASS)
iris_data <- iris
View(iris_data)

Converting Dataset

We now convert the iris dataset with the four independent variables to matrix to run density function.

iris_mat <- as.matrix(iris_data[,1:4])

Running Density Function

We use the density function to estimate densities for each observation of the dataset.

kde <- density(iris_mat)

The densities are given by the y attribute of kde. We will convert the output of densities to a data frame.

densities <- as.data.frame(kde$y)

Computing the Bench Mark

We will now compute the 'bench mark' for outlier detection. Here we compared the minimum density with the mean of density and flagged observations falling below this bench mark.

min_density <- min(densities$`kde$y`)
mean_density <- mean(densities$`kde$y`)
bench <- min_density/mean_density

Mark the outliers using the above bench mark.

densities$outlier <- ifelse(densities$`kde$y`

levels(densities1$outlier) output - "0" "1"

Summary of the Outliers

Using the summary function we look at the number of observations lying under the category of 1. Those entries will be the outliers of our dataset.

summary(densities1$outlier)

summary(densities1$outlier) output - 0:491, 1:21

One Class SVM

One Class SVM i.e. One-Class Support Vector Machine is an unsupervised algorithm that learns a decision function to identify outliers. We will be using the Iris dataset which we used for performing clustering.

Adding e1071 Library

We add preliminary library for svm function.

library(e1071)

Separating Datasets

We take the four independent variables of the iris dataset.

iris_X <- iris_data[,1:4]

Building and Fitting Model

In the following code, we first declare the value of nu which defines the upper bound of the fraction of outliers. We then initialize the model and fit it on the iris dataset.

model_oneclasssvm <- svm(iris_X,type='one-classification',kernel = "radial",
                         gamma = 0.05, nu = 0.05)
model_oneclasssvm

model_oneclasssvm output - radial kernel, gamma 0.05, nu 0.05, 9 support vectors

Defining Outliers

We now use the above-created model to identify the outliers in the dataset.

pred_oneclasssvm <- predict(model_oneclasssvm,iris_X)
pred_oneclasssvm

pred_oneclasssvm output - TRUE/FALSE vector for 150 observations

Computing Summary

We can also compute the summary using the summary command.

summary(pred_oneclasssvm)

summary(pred_oneclasssvm) output - FALSE:8, TRUE:142

All values that are equal to False are outliers.

There are various methods of outlier detection and treatment discussed in the blog - Outlier Treatment where different statistical tools can be used to identify and treat the outliers. However, in this blog, more sophisticated methods of finding the outliers were explored. Here various unsupervised models can be put to use to identify the outliers and then these outliers can be treated through the methods mentioned in Outlier Treatment.