home/modeling/application/r/classification problems

Classification Problems in R

In the Theory Section of Modeling, various classification algorithms are explored. In this blog post, all those algorithms will be put to use using R.

Importing Libraries and Preparing Dataset

We will be using the famous Titanic Dataset available on Kaggle for running the various Supervised-Classification modeling algorithms (only the train_data dataset is being used). The dependent variable here is a binary variable - 'Survived'. Our objective is to predict the survival rate of the passengers travelling on the Titanic.

Note that before using the dataset for creating classification models we need to perform some steps of pre-processing.

Importing Libraries

Library for reading excel files.

r
library(readxl)

Library for splitting the dataset.

r
library(caTools)

Library for tuning parameters and computing accuracy score of the models.

r
library(caret)

Importing Dataset

The Titanic dataset will be used for creating all the classification models. We use na.strings to tell the program to load the missing values (denoted by blank space) as NAs.

r
Titanic1 =read.csv('C:/Users/user/Desktop/Data Sets/Feature Reduction
Datasets/Titanic1.csv',header = T,na.strings = c(""))

Missing Value Treatment

Checking missing values in the dataset.

r
sapply(Titanic1, function(x) sum(is.na(x)))
sapply missing values output

We find that we have missing values in the variables - 'Age', 'Embarked' and 'Cabin'. However we are mainly concerned with the variable 'Age' and 'Embarked'.

Checking distribution of the 'Age' variable.

r
hist(Titanic1$Age, breaks = 15, freq = F, xlab = 'Age', ylab = 'Density',
main = 'Age Density Plot',col = "orange")
lines(density(Titanic1$Age, na.rm = T),col="blue",xlim=c(-10,85),lwd=2)
Age density histogram

As the distribution of the variable is not quite normal and seems a bit skewed, we take median to impute the missing values rather than considering the mean of it.

We first calculate the mean of the variable so that we can compare its value with the median value.

r
mean(Titanic1$Age,na.rm = T)
mean(Titanic1$Age) output

We now calculate the Median Age.

r
median(Titanic1$Age,na.rm = T)
median(Titanic1$Age) output

We find that the mean is approximately 30 while the Median is 28.

We also have missing values in the Embarked variable and as it is a categorical variable we consider the mode of it. However, we first check the number of levels present in this variable along with their count.

r
summary(Titanic1$Embarked)
summary(Titanic1$Embarked) output

For Mode imputation for the variable Embarked, we will first calculate mode and then replace the missing values. As discussed before, there is no inbuilt function to compute mode, therefore we will define the function for mode.

r
getmode <- function(v) {
+
uniqv <- unique(v)
+
uniqv[which.max(tabulate(match(v,uniqv)))]
+ }
v <- Titanic1$Embarked
mode <- getmode(v)
mode
mode output

We perform Mean and Mode Imputation.

r
Titanic1$Age[is.na(Titanic1$Age)]<- median((Titanic1$Age),na.rm=T)
Titanic1$Embarked[is.na(Titanic1$Embarked)]<- mode

Dropping other variables which may not be of much use to our analysis such as Passenger ID, Cabin, Name and Ticket.

r
TitanicD1 = subset(Titanic1,select = c(2,3,5,6,7,8,10,12))

Dummy Variable Creation

Creating Dummy variables for categorical variables - Pclass, Sex, Embarked.

r
install.packages("fastDummies")
library(fastDummies)
TitanicD1 = dummy_cols(TitanicD1,select_columns =
c("Pclass","Embarked","Sex"),remove_first_dummy = T)

In R we have to remove the base variables after creating n-1 dummy variables.

Removing base variables from the dataset.

r
TitanicD1 = subset(TitanicD1,select = c(1,4,5,6,7,9,10,11,12,13))

Train and Test Split

We will now be splitting our dataset into train and test using the split function.

r
set.seed(123)
split <- sample.split(TitanicD1$Survived,SplitRatio = 0.70)
train_set<- subset(TitanicD1,split==T)
test_set<- subset(TitanicD1,split==F)

The Process of Creating Classification Models

In a typical case, we follow the following steps for creating a classification model -

Step 1: Import packages required to run the particular model
Step 2: Fit the model on the Train dataset
Step 3: Predict the values on the Test dataset
Step 4: Compute the Accuracy score for the model

We also perform tuning of the hyperparameters, which is done to improve the accuracy of our model and save it from overfitting. There are mainly three ways to tune these parameters - Grid Search, Random Search, and Bayesian Optimization.

In this blog, we will be tuning our parameters using the first two methods and see how the accuracy score gets affected by it. We will be using Grid Search/Random Search that fit the best model i.e. model with best parameter values, on the train dataset and predict the value on the test dataset.

For Grid Search and Random Search, we use caret package. This function will conduct grid search/random search with cross-validation. In grid search cross-validation, all combinations of parameters are searched to find the best model. The cross-validation command in the code follows k-fold cross-validation process. Here our dataset is divided into train, validation and test set. After finding the best parameter values using Grid Search for the model, we predict the survival rate on the test dataset i.e. a kind of unseen dataset. Cross-validation helps in avoiding the problem of overfitting of the model. Please refer to Model Validation Techniques under the Theory Section for a better understanding of the concept. The concept of Hyper-Parameter tuning with cross-validation is discussed in Model Validation in R under the Application Section.

In this blog, we will perform grid search with 3 fold cross-validation. Note that Random search has the same code as grid search and only the search parameter is set to grid/random as per the requirement. Also, the parameters are defined the same as we did in Python, i.e. values of hyperparameters are defined in the form of a range.

Classification Algorithms

In the Theory Section of Classification Problems, we have explored a lot of classification algorithms and in this blog, we will create models using those algorithms to predict the survivability of a person present on the Titanic. We will be creating classification models using the following methods/algorithms - Logistic Regression, Regularized Logistic Regression, Decision Tree Classifier, KNN, SVM, ANN, Naive Bayes, Bagging Classifier (Ensemble), Random Forest Classifier (Ensemble), AdaBoosting Classifier (Ensemble), Gradient Boosting Classifier (Ensemble), Xgboost Classifier (Ensemble), and Stacking (Ensemble).

Logistic Regression

To understand how Logistic Regression works, refer to the blog on Logistic Regression in the Theory Section. In this blog post, we will use Logistic Regression algorithm to predict if a person will survive or not.

Initializing and Fitting Model

We first initialize and fit the model on the train dataset.

r
model_logistic <- glm(Survived~.,data = train_set,family = binomial(logit))
summary(model_logistic)
summary(model_logistic) output

Predicting Values on the Test Dataset

We will now predict the values on the test dataset using the model. Note that the resulting predictive values will be the probability of survival. In order to convert this to binary values, we will round off these values to 0 and 1. We will set a threshold of 0.5 where any value less than 0.5 will be assigned 0 and any value greater than 0.5 will be 1. We now predict the values using predict.

r
pred_mod_log = predict(model_logistic,newdata =subset(test_set,select =
c(2:10)),type="response")
pred_mod_log <- ifelse(pred_mod_log > 0.5,1,0)

Confusion Matrix

In order to calculate the accuracy of our model, we will be using caret package.

r
confusionMatrix(pred_mod_log,reference = test_set$Survived)
confusionMatrix output

Our model is 79% accurate. However, as discussed in the Evaluation of Classification Models under the Theory Section, these methods are insufficient and we require more advanced methods of evaluating our model, whose application in R is discussed in Model Evaluation in R.

Regularized Logistic Regression

Regularized Logistic Regression can be of two types - Ridge and Lasso. Refer to Regularized Regression Algorithms under the Theory Section to understand the difference between the two. A third type is Elastic Net Regularization which is a combination of both penalties l1 and l2 (Lasso and Ridge).

Standardize the Dataset

We will have to first scale the data as Regularized Logistic Regression penalizes the coefficients and hence we cannot have the variables with different scales of measurement. Various models of classification require scaling of data, such as - Regularized Logistic Regression (Lasso and Ridge), KNN, SVM and ANN. (We will be using the same scaled dataset for KNN and SVM to predict the survival rate of the passengers.) We first use the dataset TitanicD1 which we had created earlier that has the dummy variables. From there we first extract the numerical variables and apply scaling on them. As there are four continuous variables in our dataset namely - Age, Fare, SibSp and Parch, we will only extract and scale them.

r
scale_data = subset(TitanicD1,select = c(2,3,4,5))

We will now apply scaling on these four numerical features.

r
scaled_data = scale(scale_data)
scaled_final = as.data.frame(scaled_data)

In this step, we concatenate the scaled variables with the leftover dataset (categorical variables).

r
Titanic_cat = subset(TitanicD1,select = c(1,6,7,8,9,10))
TitanicFinal = cbind(Titanic_cat,scaled_final)

Splitting Dataset

We now split the dataset into Train and Test.

r
set.seed(123)
split1 <- sample.split(TitanicFinal $Survived,SplitRatio = 0.70)
train_set1<- subset(TitanicFinal,split1==T)
test_set1<- subset(TitanicFinal,split1==F)

Note that the above datasets will be used over and over again when we will be dealing with KNN, SVM and ANN.

Lasso

Import Library

Here we use glmnet library to perform regularized logistic regression.

r
library(glmnet)

Separating Dataset

For glmnet function, we will have to separate the datasets based on their variables and convert them to matrices as glmnet accepts only matrices for modeling.

Separating and converting the train dataset with independent features.

r
X1 <- as.matrix(train_set1[,-1])

Separating and converting the train dataset with dependent/target variable.

r
Y1 <- as.matrix(train_set1[,1])

Initialize and Fit Model

Initializing and fitting model on the train dataset.

r
lasso_model<- glmnet(X1,Y1,family = "binomial",alpha = 1)

Selecting Lambda Value

We define the value of lambda i.e. take the minimum value of lambda for prediction. Lower the value of lambda, better will be the accuracy of the model.

r
lambda_Lasso<- min(lasso_model$lambda)
lambda_Lasso
lambda_Lasso output

Separating and converting the test dataset with independent features.

r
X2 <- as.matrix(test_set1[,-1])

Predicting Values

Again, we will have to use the cut off to convert the probabilities into binary values like we did in logistic regression.

r
pred_lasso<- predict(lasso_model,newx = X2,type = "response",s=lambda_Lasso)
pred_lasso = ifelse(pred_lasso> 0.5,1,0)

Accuracy Score

We finally calculate the accuracy of this model using the metrics library.

r
library(Metrics)
Accuracy_lasso <- accuracy(test_set1$Survived,pred_lasso)
Accuracy_lasso
Accuracy_lasso output

Ridge

Initialize and Fit Model

Initializing and fitting model on the train dataset.

r
ridge_model<- glmnet(X1,Y1,family = "binomial",alpha = 0)

Define Lambda Value

We define the value of lambda i.e. take the minimum value of lambda for prediction. Lower the value of lambda, better will be the accuracy of the model.

r
lambda_R <- min(ridge_model$lambda)
lambda_R

Predicting Values

Again, we will have to use the cut off to convert the probabilities into binary values like we did in logistic regression.

r
pred_ridge = predict(ridge_model,newx = X2,type = "response",s=lambda_R)
pred_ridge1 = ifelse(pred_ridge> 0.5,1,0)

Accuracy Score

We now check the accuracy score.

r
Accuracy <- accuracy(test_set1$Survived,pred_ridge1)
Accuracy
Accuracy (Ridge) output

Elastic Net

Initialize and Fit Model

Initializing and fitting model on the train dataset.

r
enet_model<- glmnet(X1,Y1,family = "binomial",alpha =0.02)

Define Lambda Value

We define the value of lambda i.e. take the minimum value of lambda for prediction. Lower the value of lambda, better will be the accuracy of the model.

r
lambda_E <- min(enet_model$lambda)
lambda_E

Predicting Values

We will have to use the cut off to convert the probabilities into binary values like we did in logistic regression.

r
pred_enet = predict(enet_model,newx = X2,type = "response",s=lambda_E)
pred_enet = ifelse(pred_enet> 0.5,1,0)

Accuracy Score

We now check the accuracy score.

r
Accuracy_enet <- accuracy(test_set1$Survived,pred_enet)
Accuracy_enet
Accuracy_enet output

Tuning of Parameters

We will now tune the parameters for Regularized Logistic Regression using Grid Search and Random Search. As discussed above, we will first find the model with best parameters and fit the model on the Train dataset. Then, we will predict the values on test dataset and calculate the accuracy score. Here we will find the best value of the parameter lambda which is the inverse of strength of regularization.

The functions grid search and random search are available in the train function of caret package. Therefore, only those models can be trained with grid/random search that are available in the caret package. Please refer to the link below for the available training models - https://topepo.github.io/caret/train-models-by-tag.html

In this package models have sub-categories and each has its own tuning parameter. For example, in decision tree, there are more than 3 categories rpart, rpart2 etc. Each of these has a different set of hyperparameters. You may refer to the link mentioned above for the listing of different models and their hyperparameters.

Grid Search

Ridge

Import caret Library. We first start off by importing the caret package.

r
library(caret)

Defining Cross Validation. Defining the control to specify the type of cross validation and the type of search.

r
control = trainControl(method ="cv",number =3,search = "grid")

Defining Parameters. We first start off with defining the parameters.

r
params_ridge = expand.grid(alpha=1,lambda=c(10**-4, 10**-2, 10**0, 10**4))

Initializing and fitting the model. We now build the Regularized Logistic Regression model using the Grid Search and fit it on the Train dataset.

r
ridge_gridsearch <- train(Survived~.,data =
train_set1,method="glmnet",tuneGrid=params_ridge,trControl=control)

Best Parameters. bestTune attribute can be used to find the best parameters.

r
ridge_gridsearch$bestTune
ridge_gridsearch$bestTune output

In this case, it took lambda=0.01 as the best parameter value.

Prediction. We now predict the Survival on the Test dataset.

r
pred_ridge_GS = predict(ridge_gridsearch,newdata = subset(test_set1,select =
2:10))

Accuracy. We now calculate the accuracy of the model.

r
Accuracy_ridge_GS <- accuracy(test_set1$Survived,pred_ridge_GS)
Accuracy_ridge_GS
Accuracy_ridge_GS output

The accuracy comes out to be at 78.35%.

Note - Here we have used alpha = 0 for ridge. You can perform the same steps for lasso and use alpha = 1.

Elastic Net

Defining Parameters. Parameters have to be defined first. Here we will be tuning both alpha and lambda. Alpha is the penalty for lasso/ridge/elasticnet. Since alpha should have values between 0 and 1 therefore, we will tune alpha and lambda for elasticnet.

r
params_enet = expand.grid(alpha=c(0.01,0.1,0.2,0.5),lambda=c(10**-4, 10**-2,
10**0,10**4))

Building Model. We build and fit the model and search for the best parameters.

r
enet_gridsearch <- train(Survived~.,data =
train_set1,method="glmnet",tuneGrid=params_enet,trControl=control)

Best Parameters. bestTune attribute can be used to find the best parameters.

r
enet_gridsearch$bestTune

Predict and Check Accuracy. The above model is used to predict the values of the dependent variable in the Test dataset and also the accuracy is calculated.

r
pred_enet1 = predict(enet_gridsearch,newdata = subset(test_set1,select =
2:10))
Accuracy_enet_GS <- accuracy(test_set1$Survived,pred_enet1)
Accuracy_enet_GS
Accuracy_enet_GS output

Decision Tree Classifier

Decision Trees is one of the most commonly used classification algorithms. In very simple words, Decision Trees allow us to come up with flowcharts that are structured as trees and allows us to predict the value of the class variable. Its inner workings have been explained in Decision Trees under the Theory Section.

Importing Libraries

We first import rpart library.

r
library(rpart)

Initializing and Fitting

We now Initialize and Fit Decision Trees Model.

r
DTC <- rpart(Survived~.,data=train_set, method = "class")

Prediction and Calculating Accuracy

The Decision Tree model is used to predict the Y variable in the Test dataset. We also check the accuracy of this model on the Test dataset.

r
predictionDT<- predict(DTC,newdata =test_set[2:10],type = "class")
Accuracy_DTC <- accuracy(test_set$Survived,predictionDT)
Accuracy_DTC
Accuracy_DTC output

The accuracy of this Decision Tree model comes out to be approximately 79%.

Tree Visualization

We can visualize the above-created Decision Tree. This helps in further understanding how the decision tree algorithm is working.

Importing rattle, rpart.plot and RColorBrewer to plot Decision Tree.

r
library(rattle)
library(rpart.plot)
library(RColorBrewer)

Creating Decision Tree Visualization.

r
plot(DTC,uniform = TRUE,margin = 0.1)
text(DTC,use.n = TRUE,all=TRUE,cex=0.8)
fancyRpartPlot(DTC,sub = "",cex=0.8)
Decision Tree visualization using fancyRpartPlot

Tuning Hyperparameters

To show an example of how hyperparameters can be tuned, we will use rpart2 model of caret package and tune the parameter max_depth i.e. maximum depth.

Defining Parameters. Here we define the plausible values for maximum depth.

r
tune_grid = expand.grid(cp=c(0.1,0.001,0.01,0.02,0.03))

Initializing and fitting Decision Tree. We now initialize the Decision Tree model and fit it on the train dataset.

r
control = trainControl(method ="cv",number =3,search = "grid")
dtc_gridsearch <- train(Survived~.,data =
train_set,method="rpart",tuneGrid=tune_grid,trControl=control)

Best Parameters. bestTune attribute can be used to find the best parameters.

r
dtc_gridsearch$bestTune
dtc_gridsearch$bestTune output

Predict and Check Accuracy. The above model with the above-mentioned values of hyperparameters is used to predict the values of the dependent variable in the Test dataset and also the accuracy is calculated.

r
pred_dtc_gs <- predict(dtc_gridsearch,newdata = subset(test_set,select =
c(2:10)))
Accuracy_DTC_GS <- accuracy(test_set$Survived,pred_dtc_gs)
Accuracy_DTC_GS
Accuracy_DTC_GS output

The accuracy comes out to be 79%.

K Nearest Neighbour

KNN is a distance-based algorithm which predicts value based on the number of class observations found in its neighbourhood. For a detailed understanding of KNN refer to K Nearest Neighbour under the Theory Section. For KNN we will be using knn3 model from caret package.

Initializing and Fitting KNN Model

In this step, we first initialize the KNN model. We then fit this model on the Train Dataset. Note that this Train dataset which we used earlier for Regularized Logistic Regression. For KNN we need to have a standardized dataset as it uses distance as a parameter for its functioning. Therefore, for this model, we use a dataset which has all the numerical observations scaled except the target variable.

r
knn_model2 <- knn3(Survived~.,data = train_set1,k=5)

Predict and Check Accuracy

The above model is used to predict the values of the dependent variable in the Test dataset and the accuracy is calculated.

r
knn.results<- predict(knn_model2,newdata=test_set1[2:10],type = "prob")

Results

In knn3, the results are produced with probabilities of 0 and 1 component of the Survival variable. The results of first six rows are as follows -

r
knn.results
knn.results first 6 rows output

Accuracy

We will use the probabilities of 1 to convert like we did in the logistic regression where the results were converted into a binary variable to calculate the accuracy score. For that, we will first convert the knn.results into a data frame.

r
knn_results <- as.data.frame(knn.results)
knn_results['result'] <- ifelse(knn_results$`1`>0.5,1,0)
Accuracy_knn <- accuracy(test_set1$Survived,knn_results$result)
Accuracy_knn
Accuracy_knn output

Tuning Hyperparameters

In this blog post, we will tune the number of neighbours for knn model in caret package.

Grid Search

Defining Parameters. We define the values for the parameter.

r
params_knn <-expand.grid(k=c(5,6,7,8,9,10))
control = trainControl(method ="cv",number =3,search = "grid")

Building and Fitting Model. We now build a KNN model using train function of caret and fit it on the Train dataset.

r
knn_gridsearch <- train(Survived~.,data =
train_set1,method="knn",tuneGrid=params_knn,trControl=control)

Best Parameters. bestTune attribute can be used to find the best parameters.

r
knn_gridsearch$bestTune
knn_gridsearch$bestTune output

Predict and Check Accuracy. The above model with the above-mentioned values of hyperparameters is used to predict the values of the dependent variable in the Test dataset and also the accuracy is calculated.

r
pred_knn <- predict(knn_gridsearch,newdata = subset(test_set1,select = 2:10))
Accuracy_knn_GS <- accuracy(test_set1$Survived,pred_knn)
Accuracy_knn_GS
Accuracy_knn_GS output

The accuracy obtained from this model is 78%.

Support Vector Machine

SVM is a technique commonly used for solving classification problems. It has been explained in the Support Vector Machines under the Theory Section. Here we will put it to use using R.

Importing Package

We first start with importing the e1071 package.

r
library(e1071)

Initializing and Fitting Model

In this step, we initialize the SVM model and fit it on the Train dataset. Note that SVM requires standardized dataset to fit the model and make predictions and we will use the dataset that we have used earlier where the numerical features have been scaled and dummy variables have been created for categorical features.

r
modelSVM<- svm(Survived~.,data =train_set1,kernel="linear")

Predict and Check Accuracy

The above model is used to predict the values of the dependent variable in the Test dataset and the accuracy is calculated.

r
SVM.results<- predict(modelSVM,newdata=test_set1[2:10],type='response')
Accuracy_svm <- accuracy(test_set1$Survived,SVM.results)
Accuracy_svm
Accuracy_svm output

The accuracy got from the SVM model is 76%.

Tuning Hyperparameters

In this blog post, we will tune one parameter of SVM i.e. C. We will be using svmLinear for tuning of SVM model as SVM (of library e1071) is not available in caret package. Since svmLinear has kernel equal to 'linear' therefore, we will only tune the parameter C.

Grid Search

Defining Parameters. We define the plausible values for our parameter.

r
params_svm <- expand.grid(C=c(0.01,0.1,1))
control = trainControl(method ="cv",number =3,search = "grid")

Building and Fitting Model. We now build an SVM model using Grid Search and fit it on the Train dataset.

r
SVM_gridsearch <- train(Survived~.,data =
train_set1,method="svmLinear",tuneGrid=params_svm,trControl=control)

Best Parameters. We now check the selected best parameters.

r
SVM_gridsearch$bestTune

Predict and Check Accuracy. The above model and hyperparameters are used to predict the values of the dependent variable in the Test dataset and calculate the accuracy.

r
pred_svm <- predict(SVM_gridsearch,newdata = subset(test_set1,select = 2:10))
Accuracy_svm_GS <- accuracy(test_set1$Survived,pred_svm)
Accuracy_svm_GS
Accuracy_svm_GS output

The accuracy remains the same after tuning hyperparameters in this case.

Artificial Neural Networks

Running ANN Model.

r
library(RSNNS)
X1 <- as.matrix(train_set1[,-1])
Y1 <- as.matrix(train_set1[,1])
Y1 <- as.numeric(Y1)
ann <- mlp(X1,Y1)

Calculating Accuracy of the ANN Model.

r
ann_results <- predict(ann,newdata=test_set1[2:10],type='response')
ann_results <- ifelse(ann_results>0.5,1,0)
Accuracy_ann1 <- accuracy(test_set1$Survived,ann_results)
Accuracy_ann1
Accuracy_ann1 output

The accuracy comes out to be 80%.

Tuning Hyperparameters

Tuning hyperparameters using Grid Search.

r
params_ann <- expand.grid(size=c(5,10,20,30,40,45,60))
ann_gridsearch <- train(Survived~.,data = train_set1,method="mlp",
+
tuneGrid=params_ann,trControl=control)

Finding the best parameter.

r
ann_gridsearch$bestTune
ann_gridsearch$bestTune output

Calculating Accuracy.

r
pred_ann1 = predict(ann_gridsearch,newdata = subset(test_set1,select = 2:10))
Accuracy_ann_GS <- accuracy(test_set1$Survived,pred_ann1)
Accuracy_ann_GS
Accuracy_ann_GS output

The accuracy comes out to be approximately 81%.

Naive Bayes

Naive Bayes is a probabilistic model which uses the Bayesian Theorem for its working. It has been explored under the Theory Section in the blog Naive Bayes. In this blog, we will use Naive Bayes to solve this classification problem of who survives and who doesn't.

Importing e1071 Package

To run a Naive Bayes model in R, we require e1071 which we import using the library command.

r
library(e1071)

Initializing and Fitting Model

In this step, we initialise the Naive Bayes model and fit it on the Train dataset.

r
nb_model<- naiveBayes(Survived~.,data = train_set)

Predict and Check Accuracy

The above model is used to predict the values of the dependent variable in the Test dataset and calculate the accuracy.

r
pred_nb<- predict(nb_model,newdata = test_set[2:10])
Accuracy_nb <- accuracy(test_set$Survived,pred_nb)
Accuracy_nb
Accuracy_nb output

The accuracy got from the Naive Bayes model is 78%.

Ensemble Models

In the Theory section, under Ensemble Methods, various kinds of ensemble techniques have been explored. Here we will explore all those ensemble techniques using R.

Random Forest Classifier

Random Forest Classifier is a variant of Bagging Classifier only and more about it can be found in the blog Bagging available in the Theory Section.

Importing randomForest Library

We have to import randomForest to run a Random Forest Classification model.

r
install.packages("randomForest")
library(randomForest)

Initializing and Fitting Model

We initialize the Random Forest model and fit it on the Train dataset.

r
rfc<- randomForest(Survived~.,data = train_set)

Predict and Check Accuracy

The above model is used to predict the values of the dependent variable in the Test dataset. We also check the model's performance.

r
pred_rfc<- predict(rfc,newdata = test_set[2:10])
Accuracy_rf <- accuracy(test_set$Survived,pred_rfc)
Accuracy_rf
Accuracy_rf output

The accuracy got from this Random Forest model is 82%.

Tuning Hyperparameters

Grid Search

Defining Parameters. We will be tuning this random forest model based on the parameter mtry which is the number of variables sampled for splitting.

r
control = trainControl(method ="cv",number =3,search = "grid")
sqrt = sqrt(ncol(train_set))
log2 = log2(ncol(train_set))
n_features = 9
tune_grid_RF = expand.grid(mtry = c(sqrt,log2, n_features))

Initializing, Building and Fitting Model. In this step, we initialize and build the Random Forest Classification model using Grid Search and fit it on the Train dataset.

r
RF_gridsearch <- train(Survived~.,data =
train_set,method="rf",tuneGrid=tune_grid_RF,trControl=control)

Best Parameters. We now check for the best parameters.

r
RF_gridsearch$bestTune
RF_gridsearch$bestTune output

Predict and Check Accuracy. We use this model to predict the dependent variable in the test dataset and check its accuracy.

r
pred_RF_GS = predict(RF_gridsearch,newdata = subset(test_set,select =
c(2:10)))
Accuracy_RF_GS <- accuracy(test_set$Survived,pred_RF_GS)
Accuracy_RF_GS
Accuracy_RF_GS output

The accuracy comes out to be 82%.

AdaBoost Classifier

The AdaBoost classifier builds a classifier (decision tree) and if a training data point is misclassified then the weight of that training data point is increased i.e. it is boosted.

Installing and Loading Libraries

We will be importing ada package for AdaBoost Modeling.

r
install.packages("ada")
library(ada)

Initializing and Fitting AdaBoost Model

In this step, we initialize the AdaBoost model and fit this model on the Train Dataset.

r
mod_ada2 <- ada(Survived~.,data = train_set,type = "real")

Predict and Check Accuracy

The above model is used to predict the values of the dependent variable in the Test dataset and the accuracy is calculated.

r
pred_ada<- predict(mod_ada2,newdata =test_set[2:10])
Accuracy_ada <- accuracy(test_set$Survived,pred_ada)
Accuracy_ada
Accuracy_ada output

The accuracy comes out to be at approximately 81%.

Tuning Hyperparameters

In this blog post, we will tune the number of estimators and the learning rate.

Grid Search

Defining Parameters. We first define the values for our parameters.

r
params_ada =
expand.grid(iter=c(10,30,50,60,70,75),maxdepth=c(3,5,6,7),nu=c(0.05,0.1,0.2,1
))

Building and Fitting Model. We now build an AdaBoost model using Grid Search and fit it on the Train dataset.

r
adab_gridsearch <- train(Survived~.,data =
train_set,method="ada",tuneGrid=params_ada,trControl=control)

Best Parameters. We use the bestTune attribute to check for the best parameters.

r
adab_gridsearch$bestTune
adab_gridsearch$bestTune output

Nu is the learning rate and iter is the number of estimators.

Predict and Check Accuracy. The above model with the above-mentioned values of hyperparameters is used to predict the values of the dependent variable in the Test dataset and also the accuracy is calculated.

r
pred_ada_GS = predict(adab_gridsearch,newdata =subset(test_set,select =
c(2:10)))
Accuracy_ada_GS <- accuracy(test_set$Survived,pred_ada_GS)
Accuracy_ada_GS
Accuracy_ada_GS output

The accuracy obtained from this model is 79%.

Gradient Boosting Classifier

Gradient Boosting Classifier is another type of a Boosting Model. Refer to the blog Boosting under Ensemble Methods in the Theory Section to know more about it.

Initializing and Fitting Model

We initialise the model and fit it on the Train dataset.

r
library(gbm)
mod_gbm2 <- gbm(Survived~.,data = train_set,distribution = "gaussian")

Predict and Check Accuracy

The above model is used to predict the values of the dependent variable in the Test dataset and check its accuracy.

r
pred_gbm<- predict(mod_gbm2,newdata = test_set[2:10],n.trees=100)
normPreds<-(pred_gbm-min(pred_gbm))/(max(pred_gbm)-min(pred_gbm))
normPreds<- as.factor(normPreds)
Accuracy_gbm <- accuracy(test_set$Survived,normPreds)
Accuracy_gbm
Accuracy_gbm output

The accuracy got from this Gradient Boosting model is approximately 77%.

Tuning Hyperparameters

Here we tune four parameters namely - n.trees, interaction.depth, shrinkage and n.minobsinnode. n.trees is the number of trees, interaction.depth is maximum depth, shrinkage is the learning rate and n.minobsinnode is the number of minimum observations in the node.

Grid Search

Defining Parameters. Here we define our four parameters.

r
params_gradb =
expand.grid(n.trees=c(10,30,50,70),interaction.depth=c(3,5,6,7),shrinkage=c(0
.05,0.1,0.2),n.minobsinnode=c(2, 3, 10))
control = trainControl(method ="cv",number =3,search = "grid")

Initializing, Building and Fitting Model. In this step, we initialize and build the Gradient Boosting Classification model using Grid Search and fit it on the Train dataset.

r
gradb_gridsearch <- train(Survived~.,data =
train_set,method="gbm",tuneGrid=params_gradb,trControl=control)

Best Parameters. We now check the best combination of parameters.

r
gradb_gridsearch$bestTune
gradb_gridsearch$bestTune output

Predict and Check Accuracy. We use this model to predict the dependent variable in the test dataset and check its accuracy.

r
pred_gradb = predict(gradb_gridsearch,newdata = subset(test_set,select =
c(2:10)))
Accuracy_gbm_GS <- accuracy(test_set$Survived,pred_gradb)
Accuracy_gbm_GS
Accuracy_gbm_GS output

The accuracy comes out to be 80.6%.

XgBoost Classifier

XgBoost stands for Extreme Gradient Boost which is an advanced version of Gradient Boost.

Installing Libraries

Installing and loading libraries required for XGBoost.

r
install.packages("xgboost")
library(xgboost)

Transforming Datasets

In this step, we transform datasets with features and target variable to matrices for modeling.

r
x <- as.matrix(train_set[,-1])
y <- as.matrix(train_set[,1])

Initializing and Fitting Model

We initialise the model and fit it on the Train dataset.

r
mod_xgb<- xgboost(data = x,label = y,nrounds = 2,objective =
"binary:logistic")
XGBoost training log output

Prediction and Accuracy

The above model is used to predict the values of the dependent variable in the Test dataset. We also check the model's performance on the Training dataset.

r
pred_xgb<- predict(mod_xgb,newdata
=as.matrix(test_set[2:10]),type="response")
pred_xgb <- ifelse(pred_xgb > 0.5,1,0)
Accuracy_xgb <- accuracy(test_set$Survived,pred_xgb)
Accuracy_xgb
Accuracy_xgb output

The accuracy got from this XgBoost Classification model is 78.36%.

Tuning Hyperparameters

Here we tune for 7 parameters: max_depth, min_child_weight, eta, gamma, subsample, colsample_bytree and nrounds.

Grid Search

Defining Parameters. First, we define our seven parameters.

r
params_xgb =
expand.grid(nrounds=c(10,30,50,70),gamma=c(0,2,3,5),max_depth=c(3,5,6,7,8),et
a=c(0.05,0.1,0.2),min_child_weight=c(5,6,7,8),subsample=c(0.5,1),colsample_bytr
ee=c(0.5,0.7,0.9))

Initializing, Building and Fitting Model. In this step, we initialize and build the Extreme Gradient Classification model using Grid Search and fit it on the Train dataset.

r
xgb_gridsearch <- train(Survived~.,data =
train_set,method="xgbTree",tuneGrid=params_xgb,trControl=control)

Best Parameters. We now check for the best combination of parameters.

r
xgb_gridsearch$bestTune
xgb_gridsearch$bestTune output

Predict and Check Accuracy. We use this model to predict the dependent variable in the test dataset and check its accuracy.

r
pred_xgb_GS = predict(xgb_gridsearch,newdata =subset(test_set,select =
c(2:10)))
Accuracy_xgb_GS <- accuracy(test_set$Survived,pred_xgb_GS)
Accuracy_xgb_GS
Accuracy_xgb_GS output

The accuracy comes out to be 80.97%.

Stacking Classifier

Stacking is a method where we use multiple learning algorithms and get a result by combining the results of all these separate algorithms. In this blog, we will perform a Level-One stacking. To know more about it refer to the blog - Stacking under the Theory section.

Importing Library

We start off with installing Ensemble and caret library.

r
library(caretEnsemble)
library(caret)

Creating a Stacked Model

In this blog, we explored the various classification algorithms that were explored in the Theory section of Modeling. All such algorithms have been put to use using R in the blog - Classification Problems in R. Here we will be using scaled dataset as the models used for stacking require scaling. The train_set1 is the scaled train dataset and has been used for knn, ann etc models.

We change the names for the dependent class variable 'Survived'. This is done to avoid the error while computing class probabilities.

r
train_set1$Survived <- as.factor(ifelse(train_set1$Survived == 0, "N", "Y") )

Defining trControl parameter to tell the program to compute classProbs i.e. Class Probabilities.

r
control <- trainControl(method="cv", number=3,savePredictions=TRUE,
classProbs=TRUE)

Defining the algorithm list for Stacking.

r
algorithmList <- c('rf', 'glm', 'knn','svmLinear')

Initiating and fitting the above-mentioned algorithms using caretList function on the dataset.

r
models <- caretList(Survived~., data=train_set1, trControl=control,
methodList=algorithmList)

Stacking all the models through meta-layer of Logistic Regression.

r
control_S <- trainControl(method="cv", number=3, savePredictions=TRUE,
+
classProbs=TRUE)
stack_glm <- caretStack(models, method="glm", metric="Accuracy",
trControl=control_S)
stack_glm
stack_glm output

Predicting and Checking Accuracy

We now predict the dependent variable in the Test dataset and on the basis of these predictions check for the accuracy of this stacked model.

r
pred_test <- ifelse(predict(stack_glm, newdata =test_set1[2:10]) == "Y",1,0)
Accuracy_stack <- accuracy(test_set1$Survived,pred_test)
Accuracy_stack
Accuracy_stack output

In this blog, we explored the various classification algorithms that were explored in the Theory section of Modeling. All such algorithms have been put to use using Python in the blog - Classification Problems in Python.

ESC
100 pages indexed · Esc to close