// application · python

Clustering Problems in Python

Clustering Problems are one of the most common problems solved in an Unsupervised Environment. Various clustering algorithms provide us with a method of grouping observation in such a way that the observations in the same group are similar to each other than those in other groups. As mentioned above, this is an unsupervised modeling technique, as here we do not control how the clusters will be made. The only thing that we can control in this modeling is the number of clusters and the method deployed for clustering. Various clustering techniques have been explained under Clustering Problem in the Theory Section. In this blog, we will explore three clustering techniques using python: K-means, DBScan, Hierarchical Clustering.

Preparation

Adding Preliminary Libraries

We first add some important libraries that will be useful for us throughout the course.

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Importing Dataset

To demonstrate various clustering algorithms in python, the Iris dataset will be used which has three classes in the dependent variable (three type of Iris flowers) and using this dataset clusters will be formed.

python

from sklearn import datasets
iris = datasets.load_iris()
iris_data = pd.DataFrame(iris.data)
iris_data.columns = iris.feature_names
iris_data['Type']=iris.target
iris_data.head()

iris_data.head() showing sepal length, sepal width, petal length, petal width and Type columns — Output: iris_data.head().

Preparing Data

Here we have the target variable 'Type'. We need to remove the target variable so that this dataset can be used to work in an unsupervised learning environment. The iloc function is used to get the features we require. We also use .values function to get an array of the dataset. (Note that we transformed the dataset to an array so that we can plot the graphs of the clusters).

python

iris_X = iris_data.iloc[:, [0, 1, 2,3]].values

Now we will separate the target variable from the original dataset and again convert it to an array by using numpy.

python

iris_Y = iris_data['Type']
iris_Y = np.array(iris_Y)

Visualise Classes

In this Iris dataset, we have three classes. We visualise these classes in a 2-D graph. This will help us in comparing the original classes with the clusters created by the different clustering algorithms. We use the following code to plot the original dataset visualise the three type of flowers on a graph.

python

plt.scatter(iris_X[iris_Y == 0, 0], iris_X[iris_Y == 0, 1], s = 80, c = 'orange', label = 'Iris-setosa')
plt.scatter(iris_X[iris_Y == 1, 0], iris_X[iris_Y == 1, 1], s = 80, c = 'yellow', label = 'Iris-versicolour')
plt.scatter(iris_X[iris_Y == 2, 0], iris_X[iris_Y == 2, 1], s = 80, c = 'green', label = 'Iris-virginica')
plt.legend()

Scatter plot of original Iris classes showing three colour-coded groups — Output: original Iris classes.

We find that we have three classes with two types of Iris flowers overlapping each other.

Clustering Algorithms

In this blog, we will be discussing three main ways to create clusters out of the Iris dataset, which are

K-Means
DBSCAN
Agglomerative Clustering aka Hierarchical Clustering

K-Means Clustering

One of the most commonly used methods of clustering is K-means Clustering which allows us to define the required number of clusters. To know about the workings of K-means refer to the blog: K-Means in the Theory Section.

Importing Library

We import KMeans from sklearn.cluster to run a K-Means model.

python

from sklearn.cluster import KMeans

Deciding Value of K

The most crucial aspect of K-Means clustering is deciding the value of K. We do this by performing elbow analysis. Refer to the K-Means Theory blog for more information on why and how this actually works. We first run Kmeans for the value of K from 1 to 11.

python

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(iris_X)
    wcss.append(kmeans.inertia_)

We now plot the WCSS obtained from the above code. WCSS or within-cluster sum of squares is a measure of how internally coherent the clusters are. K-Means tries to minimize this criterion.

python

plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')

Elbow method plot showing WCSS vs number of clusters, elbow at k=3 — Output: elbow method plot.

In the elbow graph, we look for the points where the drop falls and the line smoothens out. In the above graph, this happens for k=3. Another way of understanding this is that we note the point at which the WCSS is less and try to find the number of clusters for our dataset. We see that at the number of clusters = 3, WCSS is less than 100, which is good for us. So we take k =3.

Running K-Means Model

We now run K-Means clustering for obtaining a 3 cluster solution.

python

cluster_Kmeans = KMeans(n_clusters=3)
model_kmeans = cluster_Kmeans.fit(iris_X)
pred_kmeans = model_kmeans.labels_
pred_kmeans

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

Visualizing Output

In the above output we got value labels: '0', '1' and '2'. For a better understanding, we can visualize these clusters.

python

plt.scatter(iris_X[pred_kmeans == 0, 0], iris_X[pred_kmeans == 0, 1], s = 80, c = 'orange', label = 'Iris-setosa')
plt.scatter(iris_X[pred_kmeans == 1, 0], iris_X[pred_kmeans == 1, 1], s = 80, c = 'yellow', label = 'Iris-versicolour')
plt.scatter(iris_X[pred_kmeans == 2, 0], iris_X[pred_kmeans == 2, 1], s = 80, c = 'green', label = 'Iris-virginica')
plt.legend()

Scatter plot of K-Means clusters showing three colour-coded groups — Output: K-Means clusters.

When compared to the original classes we find that the observations of the class label Iris-setosa has been correctly formed into a separate well-defined cluster, however, for the other two classes, clusters are not as correct. This is mainly because, in the original dataset, these two class labels were overlapping each other which makes it difficult for the clustering algorithm as it works best for clear neat separate observations. Still, the clusters have been formed. more or less correctly.

DBScan

DBScan, an acronym for Density-Based Spatial Clustering of Applications with Noise is a clustering algorithm. It makes clusters based on their densities. It identifies observations in the low-density region as outliers.

Importing Library

We import DBSCAN from sklearn.cluster to run a DBScan model.

python

from sklearn.cluster import DBSCAN
clt_DB = DBSCAN()

Initialize Model

We now initialise the DBScan Model with eps (epsilon) = 0.42

python

clt_DB = DBSCAN(eps=0.42)
clt_DB

Fit Model

In this step, we fit the model on the iris dataset and come up with the class labels.

python

model_dbscan = clt_DB.fit(iris_X)
pred_dbscan = model_dbscan.labels_
pred_dbscan

array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, -1, 1, -1, 1, 1, 1, -1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 2, 1, -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, -1, 2, 2, 2, 2, -1, -1, -1, -1, -1, 2, 2, 2, 2, -1, 2, 2, -1, -1, -1, 2, 2, -1, 2, 2, -1, 2, 2, 2, -1, -1, -1, 2, 2, -1, -1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int64)

Visualizing Output

In the above output, we got value labels: '0', '1', '2' and '-1' with the '-1' label indicting outliers. We visualise this output to see how the data has been grouped into clusters.

python

plt.scatter(iris_X[pred_dbscan == 0, 0], iris_X[pred_dbscan == 0, 1], s = 80, c = 'orange', label = 'Iris-setosa')
plt.scatter(iris_X[pred_dbscan == 1, 0], iris_X[pred_dbscan == 1, 1], s = 80, c = 'yellow', label = 'Iris-versicolour')
plt.scatter(iris_X[pred_dbscan == 2, 0], iris_X[pred_dbscan == 2, 1], s = 80, c = 'green', label = 'Iris-virginica')
plt.scatter(iris_X[pred_dbscan == -1, 0], iris_X[pred_dbscan == -1, 1], s = 80, c = 'red', label = 'outlier')
plt.legend()

Scatter plot of DBSCAN clusters with outliers shown in red — Output: DBSCAN clusters with outliers.

We find that by using DB Scan some observations are marked as outliers.

Hierarchical Clustering

Hierarchical Clustering can be of two types- Agglomerative and Divisive. In this blog post, we will explore Agglomerative Clustering which is a method of clustering which builds a hierarchy of clusters by merging together small clusters. To know more about Hierarchical Clustering refer to the blog Hierarchical Clustering under the Theory Section.

Import Library

We first import the package AgglomerativeClustering using which we will create an agglomerative clustering model.

python

from sklearn.cluster import AgglomerativeClustering

Plotting of Dendrogram

We make use of dendrogram to decide the number of clusters required for our dataset. A dendrogram is a tree diagram which illustrates the arrangement of clusters.

We first start off by importing the library that will allow us to create dendrograms.

python

import scipy.cluster.hierarchy as sch

We now pick the features from the original dataset.

python

iris_X = iris_data[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']]

We finally plot a Dendrogram which allows us to see what the threshold values should be for the clustering algorithm. Basically, we decide the number of clusters by using this dendrogram.

python

Z = sch.linkage(iris_X, method = 'median')
plt.figure(figsize=(20,7))
den = sch.dendrogram(Z)
plt.title('Dendrogram for the clustering of the dataset iris')
plt.xlabel('Type')
plt.ylabel('Euclidean distance in the space with other variables')

Dendrogram showing hierarchical clustering of the Iris dataset — Output: dendrogram for Iris dataset.

Building an Agglomerative Clustering Model

Initialise Model

We analyse the above-created dendrogram and decide that we will be making 3 clusters for this dataset.

python

cluster_H = AgglomerativeClustering(n_clusters=3)

Fitting Model

After building Agglomerative clustering, we will fit our iris data set. (Note that only the independent variables from the Iris dataset are taken into account for the purpose of clustering.)

python

model_clt = cluster_H.fit(iris_X)
model_clt

AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto', connectivity=None, linkage='ward', memory=None, n_clusters=3, pooling_func=<function mean at 0x0000000007256598>)

We now come up with the class labels.

python

pred1 = model_clt.labels_
pred1

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int64)

Visualizing Output

We use the above-found class labels and visualise how the clusters have been formed.

python

plt.scatter(iris_X.values[pred1 == 0, 0], iris_X.values[pred1 == 0, 1], s = 80, c = 'orange', label = 'Iris-setosa')
plt.scatter(iris_X.values[pred1 == 1, 0], iris_X.values[pred1 == 1, 1], s = 80, c = 'yellow', label = 'Iris-versicolour')
plt.scatter(iris_X.values[pred1 == 2, 0], iris_X.values[pred1 == 2, 1], s = 80, c = 'green', label = 'Iris-virginica')
plt.legend()

Scatter plot of Agglomerative Clustering result on Iris dataset — Output: Agglomerative Clustering result.

Here also we found that Iris-setosa has been clearly formed into separate cluster while the other clusters overlap each other.

In this blog post, we explored the application of three different clustering algorithms in python. All these algorithms have their own style of functioning and should be used when trying to solve clustering problems. To find the application of these algorithms in R refer to the blog- Clustering Problems in R.