Dimensionality Reduction in Python

There are many modeling techniques that work in the unsupervised setup that can be used to reduce the dimensionality of the dataset. Under the theory section of Dimensionality Reduction, two of such models were explored - Principal Component Analysis and Factor Analysis. In this blog we will use these two methods to see how they can be used to reduce the dimensions of a dataset.

Principal Component Analysis

Here we will explore the most important method of Feature Extraction which is Principal Component Analysis and will use this method to reduce the features and use the output in modeling.

Importing Packages

We first import the important libraries.

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA

Dataset

Here also we will be using the Boston Dataset. We will import a preprocessed dataset. This dataset has also been used in the Regression Problems using Python where the preparation of this dataset has also been explored.

python

BosData = pd.read_excel('C:/Users/user/Desktop/R - Basic Stats/BosData.xls')
BosData.head()

Removing Response Variable

As PCA works in an unsupervised learning setup, therefore we will remove the dependent i.e. response variable from our dataset. Note that PCA only works on numeric variables, and that is why we create dummy variables for categorical variables. As here we have only one categorical variable 'Chas' which is a binary categorical variable, we don't require creating a dummy variable and can use all the independent variables for performing PCA.

python

BosData2 = BosData[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']]
BosData2.head()

BosData2.head() output showing independent variables

Scaling Features

Unlike R, there is no inbuilt option in the PCA command to scale the dataset. Therefore, we will have to first scale the dataset to perform PCA in Python.

python

from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
scaled_data = scale.fit_transform(BosData2)
scaled_data

array([[-0.41771335, 0.28482986, -1.2879095 , ..., -1.45900038, 0.44105193, -1.0755623 ], [-0.41526932, -0.48772236, -0.59338101, ..., -0.30309415, 0.44105193, -0.49243937], [-0.41527165, -0.48772236, -0.59338101, ..., -0.30309415, 0.39642699, -1.2087274 ], ..., [-0.41137448, -0.48772236, 0.11573841, ..., 1.17646583, 0.44105193, -0.98304761], [-0.40568883, -0.48772236, 0.11573841, ..., 1.17646583, 0.4032249 , -0.86530163], [-0.41292893, -0.48772236, 0.11573841, ..., 1.17646583, 0.44105193, -0.66905833]])

We can change the above output into a dataset.

python

scaled_data = pd.DataFrame(scaled_data,columns=BosData2.columns)
scaled_data.head()

scaled_data.head() showing standardized features

Splitting the Dataset into Train and Test

It is important to note at this point that PCA should not be made to run on the entire dataset as this would cause the dataset to leak thus causing overfitting.

Also, we should not perform PCA on train and test separately as the level of variance will be different in both these datasets which will cause the final vectors of these two datasets to have different directions. This is a Catch-22 situation and to get out of it we first divide the dataset into train and test and perform PCA on the train dataset and transform the test dataset using that PCA model (that was fitted on the train dataset). Below we use the sklearn package to split the data into train and test.

python

from sklearn.model_selection import train_test_split
Y = BosData['ln_Price']
X_train,X_test,Y_train,Y_test=train_test_split(scaled_data,Y,test_size=0.3,random_state=123)

Initialize and Fit PCA

We first initialize PCA for having 13 components (for 13 continuous variables in the dataset) and then we fit this model on the scaled features.

python

pca = PCA(n_components=13)
pca_model = pca.fit(X_train)

PCA(copy=True, iterated_power='auto', n_components=13, random_state=None, svd_solver='auto', tol=0.0, whiten=False)

Generate PCA Loadings

We use the transform command which transforms the scaled data to PCA loadings for each observation.

python

pca_train = pca_model.transform(X_train)
pca_train

array([[-2.20617351, -1.57003519, 1.6751961 , ..., 0.4204781 , 0.13019199, 0.00494571], [-2.89532792, 0.63128307, -0.11299369, ..., -0.16710338, 0.27332735, 0.19837286], [-1.32718799, -0.86337639, -0.69065364, ..., 0.45812776, -0.26897647, 0.13780815], ..., [-1.14164432, 0.01875308, -1.03558749, ..., -0.12540989, 0.17480516, 0.0733798 ], [ 3.67532679, 0.24434832, -0.44655826, ..., 0.32995562, -0.10066117, 0.12565801], [ 3.34474543, 0.5677084 , -1.71967202, ..., -0.90706017, -0.58903258, 0.28774733]])

Generate Loading Matrix

We now generate the principal components loading matrix by using the attribute components_ of the pca command for each variable.

python

Variable_Names =['crim','zn','indus','chas','nox','rm','age','dis','rad','tax','ptratio','black','lstat']
Matrix = pd.DataFrame(pca_model.components_,columns=Variable_Names)
Matrix1 = np.transpose(Matrix)
Matrix1

PCA loading matrix showing component correlations per variable

This Loading Matrix is like a correlation matrix. The variable having the highest correlation with the columns will be the first principal component. For example, the variable indus has the highest correlation with PC1, therefore, indus will be PC 1. (The heading in the output is the PC1, PC2 and so on. We will be renaming them in the upcoming steps).

Variance Explained by Each Principal Component

As we saw above, we took the number of components for PCA equal to the number of variables in our dataset (which is 13 in our case). However, now with the following code, we will figure out the optimum value of the number of components to run PCA i.e. reduce the number of components to be considered for the modeling algorithms and thus in a way reducing the number of features. In order to decide the number of Principal Components, we analyze the proportion of variance explained by each component. We use the explained_variance function for computing variance explained by each Principal Component.

python

pca_model.explained_variance_

array([6.14884865, 1.38168311, 1.17666638, 0.82088745, 0.78048055, 0.67948959, 0.56762721, 0.38628867, 0.27780625, 0.2215816 , 0.19366454, 0.15748932, 0.06630668])

Ratio of Variance Explained by Each Component

We can now look at the proportion of variance explained by each PC.

python

var = pca_model.explained_variance_ratio_
var

array([0.47818141, 0.10745023, 0.09150656, 0.06383847, 0.06069613, 0.0528423 , 0.04414302, 0.03004076, 0.02160433, 0.01723188, 0.01506083, 0.01224757, 0.00515651])

From the output we find that PC1 explains 47% of the variance, PC2 explains 11% and so on. We find that the first seven components explain approximately 90% of the variance (0.47818141 + 0.10745023 + 0.09150656 + 0.06383847 + 0.06069613 + 0.0528423 + 0.04414302 = 0.89865812).

PCA Chart

In the above step, we got the proportion of variance explained by each component which we need to decide the number of components. We calculated that the first seven components explain most of the variance, however, for a more visual approach, we plot the explained variance on a line graph. Here we plot the ratio of variance explained by each component using a line graph. This PCA chart helps us to decide the number of principal components to be taken for the modeling algorithm.

python

cumulative_var = np.cumsum(np.round(var, decimals=4)*100)
plt.plot(cumulative_var,'k-o',markerfacecolor='None',markeredgecolor='k')
plt.title('Principal Component Analysis',fontsize=12)
plt.xlabel("Principal Component",fontsize=12)
plt.ylabel("Cumulative Proportion of Variance Explained",fontsize=12)

Renaming Columns

For our ease and convenience, we will rename the columns of the loading matrix that was generated for each observation using PCA. After renaming, we will select 7 principal components and make a data frame with the dependent variable and the 7 PCs.

python

pca_train = pd.DataFrame(pca_train,columns=['PC_' + str(i) for i in range(1, 14)])
pca_train.head()

pca_train.head() showing renamed PC columns

Concatenate Dependent Variable and Principal Components

We now concatenate the dependent variable i.e. ln_Price with principal components and take the first seven components for our analysis. First, we will reset the index for Y_train as we need to concatenate datasets to make one whole train dataset. Then we will remove the index variable from the dataset and make a subset of the dataset having 7 PCs and the dependent variable.

python

Y_train1 = Y_train.reset_index()

pca_train1 = pd.concat([pca_train,Y_train1],axis=1)
pca_train2 = pca_train1.drop(columns='index')
pca_train3 = pca_train1[['PC_1', 'PC_2', 'PC_3', 'PC_4', 'PC_5', 'PC_6', 'PC_7','ln_Price']]

pca_train3.head() showing 7 PCs and ln_Price

Creating Dataset Having Principal Components

The above output forms the complete train dataset. As now we will be performing linear regression on this dataset we are required to create a separate dataset having all the principal components i.e. the independent features.

python

pca_train_X = pca_train3[['PC_1', 'PC_2', 'PC_3', 'PC_4', 'PC_5', 'PC_6', 'PC_7']]
pca_train_X.head()

Initializing and Fitting Linear Regression Model

We use linear_model from sklearn and initialise a Linear Regression Model.

python

from sklearn import linear_model
linreg_model = linear_model.LinearRegression()
linreg_model.fit(pca_train_X,Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Transform Features of Test Dataset into Principal Components

As mentioned earlier, we will transform the features of the dataset into Principal Components using the PCA model created earlier.

python

pca_test = pca.transform(X_test)
pca_test

array([[ 4.81329234, 3.20633753, 2.51806951, ..., -1.43486703, -0.75765911, -0.05695684], [-1.77738185, -0.39344249, -0.37587683, ..., -0.08523522, -0.08728565, -0.00882654], [-2.79521396, -0.79359615, 1.69385961, ..., -0.1951745 , -0.23839034, 0.26179636], ..., [ 1.98913418, -1.97663068, 0.05119011, ..., -0.62503645, 1.03269592, -0.02378012], [ 3.2439388 , 1.98449416, 0.9706703 , ..., 0.09288781, -0.65191152, 0.07920085], [ 1.99889739, 0.42766807, 0.24314794, ..., 0.08591696, -0.40701423, 0.17921845]])

We now convert the above output into a dataset and add the dependent variable to it so that we can predict values using the above created linear regression.

python

pca_test = pd.DataFrame(pca_test,columns=['PC_' + str(i) for i in range(1, 14)])
Y_test1 = Y_test.reset_index()
pca_test1 = pd.concat([pca_test,Y_test1],axis=1)
pca_test1 = pca_test1.drop(columns='index')
pca_test2 = pca_test1[['PC_1', 'PC_2', 'PC_3', 'PC_4', 'PC_5', 'PC_6', 'PC_7','ln_Price']]
pca_test2.head()

pca_test2.head() showing test PCs and ln_Price

Above we got the complete dataset. We now separate the Principal Components in a separate dataset.

python

pca_test_X = pca_test2[['PC_1', 'PC_2', 'PC_3', 'PC_4', 'PC_5', 'PC_6', 'PC_7']]
pca_test_X.head()

pca_test_X.head() showing 7 PC columns for test set

Prediction

We now predict the dependent variable of the test dataset using the linear regression model created earlier.

python

predict1 = linreg_model.predict(pca_test_X)

Results

We calculate the R-square to know the accuracy of our model.

python

from sklearn import metrics
metrics.r2_score(Y_test,predict1)

0.6335723851620103

The accuracy comes out to be 63%.

Factor Analysis

Factor Analysis is a method which works in an unsupervised setup and forms groups of features by computing the relationship between the features. It is commonly used to reduce features and is explored in Factor Analysis under the theory section. We will now explore the application of Factor Analysis in Python. Factor analysis can only be used to reduce continuous variables of the dataset. Therefore, we will be removing categorical variables. Again, like principal component analysis, this is an unsupervised learning algorithm and hence we will be removing the dependent variable from our dataset.

Importing Preliminary Libraries

From factor_analyzer we will import FactorAnalyzer which will come handy for performing factor analysis.

python

from factor_analyzer import FactorAnalyzer

Dataset

We will be using the pre-processed Boston dataset that we have already normalized by creating a new variable ln_Price which is log of the dependent variable i.e. Price.

python

Bos_train2 = pd.read_excel('C:/Users/user/Desktop/Data Sets/Linear Regression/Bos_train1.xls')

Removing the Dependent and Categorical Variables

As mentioned above, factor analysis works in an unsupervised setup only for the numerical variables, therefore, we will get rid of the categorical and the dependent variable.

python

Factor1 =Bos_train2[['CRIM', 'ZN', 'INDUS','NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']]

Creating Correlation Matrix for the Above Dataset

To perform factor analysis we first create a correlation matrix using the above dataset. We can also manually analyse this matrix as this will give us an idea of the variables that are highly correlated with each other.

python

corrm = Factor1.corr()
corrm

Correlation matrix of Boston dataset features

Finding Eigen Values

We will now find the eigenvalues to decide the number of factors that will be sufficient for our modeling i.e. deciding the number of variables we will use during modeling.

python

eigen_values = np.linalg.eigvals(corrm)
eigen_values_cumvar = (eigen_values/12).cumsum()
eigen_values_cumvar

array([0.51021016, 0.62190071, 0.72018953, 0.78975709, 0.84532888, 0.8904269 , 0.92350045, 0.92888802, 0.95201836, 0.96611667, 0.98164267, 1. ])

Clearly, the four factors explain approximately 79% of the variance. Therefore, the number of factors will be equal to 4 in our case.

Using Factor Analyzer to Perform Factor Analysis

Now we will compute the factor loadings to group the variables based on their correlation values.

python

Factor_Analysis = FactorAnalyzer(n_factors=4, rotation='varimax', method='ml')
Factor_Analysis.fit(Factor1)

Here, we have used rotation equal to varimax to get maximum variance and the method deployed for factor analysis is maximum likelihood. R and Python use methods - maximum likelihood or minres. However other software such as SAS uses the method principal component analysis (PCA) to compute factor loadings.

Compute Factor Loadings

We now compute factor loadings.

python

loadings = pd.DataFrame(Factor_Analysis.loadings_, index=Factor1.columns)
loadings

We now export the above output and sort it to find the 4 groups of features.

python

loadings.to_excel('C:/Users/user/Desktop/Data Sets/Linear Regression/python_FA.xls')

Sorted factor loadings showing 4 variable groups

We can select variables for each of these groups which will help us in decreasing the features and decrease the chances of multicollinearity.

We can also compute the proportional variance and cumulative variance of the 4-factor solution.

python

Factor_Analysis.get_factor_variance()

Factor variance output showing proportional and cumulative variance

There are many more algorithms such as decision trees which work in a supervised setup and can be used to reduce the dimensionality of the dataset. In unsupervised setup, PCA and factor analysis are the most commonly used models to reduce the dimensionality of the dataset. Both these methods have been put to use for reducing the dimensionality of the dataset using R in the blog Dimensionality Reduction in R.