Anomaly Detection in R
Anomaly Detection helps in identifying outliers in a dataset. Various Anomaly Detection techniques have been explored in the theoretical blog Anomaly Detection. In this blog post, we will explore two ways of anomaly detection - Kernel Density and One Class SVM.
Kernel Density Estimation
Kernel Density Estimation is a method to detect outliers/anomalies in a dataset. We make use of the kernel density estimates and decide the benchmark for outliers. This has been discussed in detail in the theoretical blog Anomaly Detection.
Importing Dataset
To demonstrate Kernel Density Estimation (KDE) in R, the Iris dataset will be used which has three classes in the dependent variable (three type of Iris flowers).
library(MASS) iris_data <- iris View(iris_data)
Converting Dataset
We now convert the iris dataset with the four independent variables to matrix to run density function.
iris_mat <- as.matrix(iris_data[,1:4])
Running Density Function
We use the density function to estimate densities for each observation of the dataset.
kde <- density(iris_mat)
The densities are given by the y attribute of kde. We will convert the output of densities to a data frame.
densities <- as.data.frame(kde$y)
Computing the Bench Mark
We will now compute the 'bench mark' for outlier detection. Here we compared the minimum density with the mean of density and flagged observations falling below this bench mark.
min_density <- min(densities$`kde$y`) mean_density <- mean(densities$`kde$y`) bench <- min_density/mean_density
Mark the outliers using the above bench mark.
densities$outlier <- ifelse(densities$`kde$y`

Summary of the Outliers
Using the summary function we look at the number of observations lying under the category of 1. Those entries will be the outliers of our dataset.
summary(densities1$outlier)

One Class SVM
One Class SVM i.e. One-Class Support Vector Machine is an unsupervised algorithm that learns a decision function to identify outliers. We will be using the Iris dataset which we used for performing clustering.
Adding e1071 Library
We add preliminary library for svm function.
library(e1071)
Separating Datasets
We take the four independent variables of the iris dataset.
iris_X <- iris_data[,1:4]
Building and Fitting Model
In the following code, we first declare the value of nu which defines the upper bound of the fraction of outliers. We then initialize the model and fit it on the iris dataset.
model_oneclasssvm <- svm(iris_X,type='one-classification',kernel = "radial",
gamma = 0.05, nu = 0.05)
model_oneclasssvm 
Defining Outliers
We now use the above-created model to identify the outliers in the dataset.
pred_oneclasssvm <- predict(model_oneclasssvm,iris_X) pred_oneclasssvm

Computing Summary
We can also compute the summary using the summary command.
summary(pred_oneclasssvm)

All values that are equal to False are outliers.
There are various methods of outlier detection and treatment discussed in the blog - Outlier Treatment where different statistical tools can be used to identify and treat the outliers. However, in this blog, more sophisticated methods of finding the outliers were explored. Here various unsupervised models can be put to use to identify the outliers and then these outliers can be treated through the methods mentioned in Outlier Treatment.
TM