// application · python

Model Validation in Python

Various model validation techniques can be used to find the best model and the best set of hyperparameters. We not only validate the performance of the model on our train dataset but also our test/unseen dataset. In this blog, we will be discussing a range of cross-validation methods that can be used to validate supervised learning models in Python. Many of these methods have been explored under the theory section in Model Validation. Here we will be using a pre-processed Boston dataset to demonstrate regression-based cross-validation techniques and a pre-processed Titanic dataset for classification-based cross-validation.

Preparation

Import of Preliminary Libraries

We first start off by adding some basic libraries.

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression

Dataset

As mentioned above, we will be using the pre-processed Boston dataset for applying all of the cross-validation techniques (except stratified cross-validation) that we have used earlier in the application part of the modeling section.

python

BosData = pd.read_excel('C:/Users/user/Desktop/Data Sets/Linear Regression/Bos_train1.xls')
BosData.head()

BosData.head() - first 5 rows of the Boston dataset with columns CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT, Price, ln_Price — Output: the Boston dataset.

Now, we divide the dataset based on their variables. We create an X dataset which will have all the Independent Features while Y is a dataset with the dependent variable i.e. Target Variable (in this case it is 'ln_Price').

python

X = BosData[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']]
Y = BosData['ln_Price']

Validation for Finding Best Model

We will first use the different validation techniques to find the best models. Here we don't perform any hyperparameter tuning and simply see how the model is performing on the test dataset/s and based on the accuracy scores find the best model.

Holdout Cross-Validation

Rather than a simple or degenerate form of cross-validation, Holdout cross-validation is generally known as the 'simple validation' technique. Here we simply divide the dataset into two parts with the first part being the Train dataset where we fit the model and learn the function and the second being Test where the model is made to perform and is evaluated upon. Here we will run a Linear Regression algorithm on the Boston dataset and will use the holdout cross-validation technique.

Splitting Dataset

We first split the dataset into train and test. We take a 70:30 ratio keeping 70% of the data for training and 30% for testing.

python

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=123)

Initializing Linear Regression Model

Here we initialize the Linear Regression model. Right now we are using no hyperparameters and simply use LinearRegression() to initialize.

python

linreg_model = linear_model.LinearRegression()

Fitting Model

We fit the Linear Regression model on the Train Dataset.

python

linreg_model.fit(X_train,Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Prediction

The Linear Regression model now used to predict the Y variable in the Test dataset.

python

pred_linmodel = linreg_model.predict(X_test)

Calculating Accuracy

We can now calculate the accuracy of the model. For doing so we first import metrics from sklearn and calculate the R² which tells us of the model's performance on the Test dataset.

python

from sklearn import metrics
metrics.r2_score(Y_test,pred_linmodel)

0.7194695009566641

This model provides us with 71% Accuracy however as discussed in the theory section, holdout cross-validation can easily lead our model to overfit and thus more sophisticated methods such as k-fold cross validation must be used.

K-Fold Cross Validation

In this method, we repeatedly divide our dataset into train and test where we fit the model on train and run it on test and get the accuracy score and this way we are able to use all of the dataset and are able to use the same data points for training as well as for testing. In the end, we average all such scores and the final score becomes the accuracy of our model. In Python, we can perform K-Fold Cross-Validation using two libraries, one is cross_val_score while the other is KFold and both can be found in sklearn.model_selection.

cross_val_score

Importing cross_val_score: We first import the package cross_val_score from sklearn.model_selection to perform K-Fold Cross-Validation.

python

from sklearn.model_selection import cross_val_score

Initializing Linear Regression Model: We then initialise a simple Linear Regression model.

python

from sklearn import linear_model
lin_reg = linear_model.LinearRegression()

Running cross-validation: We now run K-Fold Cross Validation on the dataset using the above created Linear Regression model. Here we use 5 as the value of K.

python

lin_model_cv = cross_val_score(lin_reg,X,Y,cv=5)

Cross-Validation Scores: We compute the accuracy scores obtained from each of the 5 iterations performed during the 5-Fold Cross-Validation.

python

lin_model_cv

array([0.69921496, 0.67542148, 0.61784791, 0.48789074, 0.51567533])

Final Score: As mentioned earlier, we average out all the cross-validation scores and come up with a single, final score which serves as the accuracy score of our model.

python

print("RSquare: %0.2f (+/- %0.2f)" % (lin_model_cv.mean(), lin_model_cv.std() * 2))

RSquare: 0.60 (+/- 0.17)

We get a 60% accuracy when we use K-Fold Cross-Validation. Note that all the five iterations and consequently their average resulted in low R-square value, which is less than what we got when we used hold-out cross validation even when the modeling algorithms were exactly same in both the methods indicating that the K-Fold Cross-Validation is addressing the problem of over-fitting of our model and is providing us with the right picture by giving us the correct, more unbiased and real evaluation score of our model. Also, note that cross_val_score by default runs a K-Fold Cross-Validation when working with a Regression Model whereas it runs a Stratified K-Fold Cross-Validation when dealing with a Classification Model.

KFold

Importing KFold: Another method of performing K-Fold Cross-Validation is by using the library KFold found in sklearn.model_selection.

python

from sklearn.model_selection import KFold

Running KFold: We now run K-Fold Cross Validation on the dataset using the above Linear Regression model created earlier. Here also we will use 5 as the value of K. Here we declare the value of K in n_splits.

python

lin_model_kfold = KFold(n_splits=5)

Cross-Validation Scores: We compute the accuracy scores obtained from the 5 iterations.

python

cross_val_score(lin_reg, X, Y, cv=lin_model_kfold)

array([0.69921496, 0.67542148, 0.61784791, 0.48789074, 0.51567533])

We get the exact same scores which we got when we used cross_val_score to run K-Fold Cross-Validation.

Shuffle Split K-Fold Cross-Validation

It is a variant of K-Fold Validation which randomly splits the data so that no observation is left while cross validating the dataset. Here you can specify the size of the test dataset and n_splits specify the number of times the process of splitting will take place.

Running Shuffle Split and Obtaining Scores: We use the Linear Regression model created earlier and perform a 5-Fold Cross-Validation using 30% of the dataset being the test size. We then calculate the accuracy scores found during each of the iterations.

python

from sklearn.model_selection import ShuffleSplit
lin_shuffle_split = ShuffleSplit(n_splits=5,test_size=0.3)
cross_val_score(lin_reg, X, Y, cv=lin_shuffle_split)

array([0.70236081, 0.79318408, 0.80679834, 0.76008397, 0.7542301 ])

Repeated K-Fold Cross-Validation

It is used to run K-Fold multiple times, where it produces different split in each repetition.

Running Repeated K-Fold and Obtaining Scores: We use the Linear Regression model and perform a 5-Fold Cross-Validation with 5 repetitions for each fold and then calculate the accuracy scores for all the iterations.

python

from sklearn.model_selection import RepeatedKFold
lin_rkfold = RepeatedKFold(n_splits=5,n_repeats=5)
cross_val_score(lin_reg, X, Y, cv=lin_rkfold)

array([0.73759656, 0.77775281, 0.81230852, 0.77165145, 0.73759371, 0.78772348, 0.74373336, 0.82300168, 0.7563859 , 0.74394437, 0.79656632, 0.75234766, 0.72712025, 0.76030937, 0.81009001, 0.78984165, 0.80368789, 0.74387172, 0.7472918 , 0.75967043, 0.75320543, 0.79755522, 0.80178623, 0.77372088, 0.72041396])

We notice from the output, that 25 iterations resulted in different levels of R-Square values. This is because we specified the number of split = 5 and number of times the process to be repeated equal to 5. Therefore, 5×5 equal to 25 iterations.

Leave-One-Out Cross-Validation (LOOCV)

In Leave-One-Out Cross-Validation, every observation in the dataset gets a turn as the test set. For each iteration, the model is trained on all remaining observations and tested on the single held-out one, which means the number of folds equals the number of rows in the dataset. In sklearn, LeaveOneOut from sklearn.model_selection handles this directly.

python

from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
loo_scores = cross_val_score(lin_reg, X, Y, cv=loo)
print("RSquare: %0.2f (+/- %0.2f)" % (loo_scores.mean(), loo_scores.std() * 2))

RSquare: 0.68 (+/- 1.18)

LOOCV uses every data point for testing, which gives a thorough estimate of model performance. The trade-off is computation: with a large dataset, the number of model fits equals the number of rows, which gets expensive quickly.

Leave-P-Out Cross-Validation (LpOCV)

Leave-P-Out Cross-Validation is a generalisation of LOOCV where instead of holding out one observation at a time, P observations are held out as the test set in each iteration. All possible combinations of P observations from the dataset are used, making this even more exhaustive - and more computationally expensive - than LOOCV. In sklearn, LeavePOut from sklearn.model_selection is used.

python

from sklearn.model_selection import LeavePOut
lpo = LeavePOut(p=2)
lpo_scores = cross_val_score(lin_reg, X, Y, cv=lpo)
print("RSquare: %0.2f (+/- %0.2f)" % (lpo_scores.mean(), lpo_scores.std() * 2))

RSquare: 0.68 (+/- 1.20)

With P=2, the number of splits grows combinatorially with dataset size. For most practical datasets, LpOCV is only feasible with small P values and small datasets.

Stratified K-Fold Cross-Validation

StratifiedKFold is only used for Classification models. This method of validation helps in balancing the class labels during the cross-validation process so that the mean response value is almost same in all the folds. To perform Stratified K-Fold Cross-Validation we will use the Titanic dataset and will use logistic regression as the learning algorithm.

Dataset: As this method only works for classification problems, we will import the pre-processed Titanic Dataset. Note that this dataset has been used in the previous section and its preprocessing can be found in the blog Classification Problems in Python.

python

TitanicD1 = pd.read_excel('C:/Users/user/Desktop/Data Sets/Logistic Regression/TitanicD1.xls')
Y_Titanic = TitanicD1['Survived']
scaled_final = pd.read_excel('C:/Users/user/Desktop/Data Sets/Logistic Regression/scaled_final.xls')
scaled_final.head()

scaled_final.head() - first 5 rows of the scaled Titanic dataset with features Age, SibSp, Parch, Fare, Pclass_2, Pclass_3, Embarked_Q, Embarked_S, Sex_male — Output: the scaled Titanic dataset.

Initializing Logistic Regression Model: We now initialise the logistic regression model.

python

log_reg = LogisticRegression()

Performing Stratified K-Fold Cross-Validation: We perform a Stratified K-Fold Cross-Validation with the value of K being 5.

python

log_stratified = StratifiedKFold(n_splits=5)
log_stratified_score = cross_val_score(log_reg,scaled_final,Y_Titanic,cv=log_stratified)

array([0.7877095 , 0.81005587, 0.78651685, 0.76404494, 0.83050847])

Calculating Validation Scores: We now calculate the accuracy scores got from each cross-validation fold.

python

log_stratified_score

Final Score: We average out all the above scores to come up with a final evaluation score.

python

print("Accuracy Score: %0.2f (+/- %0.2f)" % (log_stratified_score.mean(), log_stratified_score.std() * 2))

Accuracy Score: 0.80 (+/- 0.05)

Validation for Finding Best Model and HyperParameters

We used K-Fold Cross Validation not only to find the best model but also to come up with the correct set of hyperparameters. Here we will tune the hyperparameters while we run K-Fold Cross-Validation.

K-Fold Cross-Validation with Grid Search

Here we will perform parameter estimation using grid search with cross-validation. To demonstrate Cross-Validation with Grid Search we will be using Random Forest Regressor as the learning algorithm and we will take the entire pre-processed Boston dataset for Grid Search Cross-Validation. Here, the data will be split into train and test using k-fold cross-validation, and hyperparameters will be tuned on the train dataset while the accuracy will be predicted on the test dataset.

Import GridSearchCV: We import GridSearchCV from sklearn.model_selection which allows us to tune hyperparameters.

python

from sklearn.model_selection import GridSearchCV

Importing RandomForestRegressor Library: We have to import RandomForestRegressor from sklearn.ensemble to run a Random Forest Regression model.

python

from sklearn.ensemble import RandomForestRegressor

Listing the parameters for GridSearchCV: Here we tune for 4 parameters: max_depth, max_features, min_samples_split and min_samples_leaf.

python

params_RF = {"max_depth": [3,5,6,7,8,9],
              "max_features":['auto', 'sqrt', 'log2'],
              "min_samples_split": [2, 3,5,7],
              "min_samples_leaf": [1, 3,5,6]}

On scikit-learn ≥ 1.3, max_features='auto' has been removed for RandomForest - use 'sqrt' (its old 'auto' behaviour), 'log2', or a float. The grids and best-parameter outputs are shown as originally run.

Initializing and Fitting Model: We initialise the Random Forest model and then fit it on the dataset.

python

model_RF_GS = GridSearchCV(RandomForestRegressor(), param_grid=params_RF,cv=5)
model_RF_GS.fit(X,Y)

GridSearchCV(cv=5, error_score='raise', estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params=None, iid=True, n_jobs=1, param_grid={'max_depth': [3, 5, 6, 7, 8, 9], 'max_features': ['auto', 'sqrt', 'log2'], 'min_samples_split': [2, 3, 5, 7], 'min_samples_leaf': [1, 3, 5, 6]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameter values for the model: We now use best_params_ to find the values of parameters selected during the above process.

python

model_RF_GS.best_params_

{'max_depth': 9, 'max_features': 'log2', 'min_samples_leaf': 3, 'min_samples_split': 5}

Predicting the values: We predict the values of the dependent variable. This will allow us to compute the accuracy score which here will be R-Square.

python

pred_rf_grid = model_RF_GS.predict(X)

Computing R-Square: We now compute the R-Square which helps us in evaluating the model.

python

from sklearn import metrics
metrics.r2_score(Y,pred_rf_grid)

0.92491576809803

We get 92% accuracy from the Random Forest Regressor. Note that this process resulted in very high accuracy. This may be due to the entire dataset getting leaked during the modeling process resulting in overfitting. Therefore, to avoid this problem we can use two methods: A combination of Holdout and K-Fold-Cross Validation or can perform Nested Cross-Validation.

K-Fold and Holdout Cross-Validation

A most common way of avoiding the model from getting overfit is by using a combination of K-Fold and Holdout Cross-Validation. So far we have used the entire dataset when using K-Fold Cross-Validation and this is indeed the biggest advantage of K-Fold that it allows the use of the entire dataset for learning the function and testing it. However, this can potentially lead to overfitting and to avoid it we compromise with this advantage of k-fold which allows us to use the whole of the dataset. Here we first use Holdout method to split the dataset into Train and Test. This Test dataset acts as an unseen data and is used to evaluate the model. We then use the Train dataset for K-fold Cross-Validation where this Train dataset is repeatedly split into Train and Test and the model gets trained and tested on all of this Train dataset. The model we acquire from this method is then used to predict the values of the unseen dataset i.e. the Test dataset obtained from the Holdout method and the accuracy score got from this model helps in giving us a better, unbiased picture of the performance of our model. This method can be used in any type of K-Fold Cross-Validation for either selecting the best model or for finding the best parameters or both.

Here we will perform the K-Fold Cross-Validation with Grid Search using the Random Forest as the learning algorithm as done above however this time we will fit the model on the Train dataset obtained from the Holdout Cross-Validation and evaluate its performance on the Test dataset (also got from Holdout Cross-Validation).

Splitting Data using Holdout Cross-Validation: We start from the scratch and first take the preprocessed Boston dataset and create two datasets with the X dataset having all the independent feature and Y dataset having the dependent variable. We then divide the dataset into Train and Test using Holdout method. Here the Test dataset will act as unseen data and it is 30% of the whole dataset.

python

X = BosData[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']]
Y = BosData['ln_Price']
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=123)

Declaring Parameters: We now declare the hyperparameters that we need to tune.

python

params_RF = {"max_depth": [3,5,6,7,8,9],
              "max_features":['auto', 'sqrt', 'log2'],
              "min_samples_split": [2, 3,5,7],
              "min_samples_leaf": [1, 3,5,6]}

Initializing and Fitting Model: In this step, we initialize and build the Random Forest Regression model using GridSearchCV and fit it on the Train dataset rather than the whole dataset as done earlier.

python

rfr = RandomForestRegressor()
model_RF_GS = GridSearchCV(rfr, param_grid=params_RF)
model_RF_GS.fit(X_train,Y_train)

GridSearchCV(cv=None, error_score='raise', estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params=None, iid=True, n_jobs=1, param_grid={'max_depth': [3, 5, 6, 7, 8, 9], 'max_features': ['auto', 'sqrt', 'log2'], 'min_samples_split': [2, 3, 5, 7], 'min_samples_leaf': [1, 3, 5, 6]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We now check for the best combination of parameters.

python

model_RF_GS.best_params_

{'max_depth': 9, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 3}

Predict and Check Accuracy: We use this model to predict the independent variable in the test data which we obtained during the holdout cross-validation dataset and check its accuracy.

python

pred_RF_GS = model_RF_GS.predict(X_test)
metrics.r2_score(Y_test,pred_RF_GS)

0.7972859748762823

The accuracy comes out to be 79% which is less than we received earlier when we used the whole dataset however this might be because we are able to address the problem of overfitting now.

Nested Cross-Validation

In the blog, K-Fold Cross-Validation present under the theory section, Nested Cross-Validation have been explored. Nested Cross-Validation avoids the data getting leaked during the process of modeling. Here the dataset gets split into Train and Test for each iteration and within each of these iterations, the Train dataset gets further split into Train and Validation where the Hyperparameters are tuned in the Train and are evaluated on the Validation. This is known as the inner fold. The learning algorithm is then tested on the Test dataset and this forms the outer fold. In Python to perform Nested Cross-Validation, two K-Fold Cross-Validations are performed on the dataset i.e. inner cross-validation and outer cross-validation. The cross-validation performed with GridSearchCV is inner cross-validation while the cross-validation performed during the fitting of the best parameter model on the dataset is outer cv. As mentioned, the inner cv first divides the dataset into train and validation test and outer cv divides the dataset into train and test which can be said as an unseen data. This results in getting an unbiased accuracy of our model.

Here we will follow the similar steps as mentioned above in K-Fold Cross-Validation with Grid Search and will use the model obtained from it to perform outer cross-validation.

Inner Fold: We first perform K-Fold for tuning Hyperparameters.

python

params_RF = {"max_depth": [3,5,6,7,8,9],
              "max_features":['auto', 'sqrt', 'log2'],
              "min_samples_split": [2, 3,5,7],
              "min_samples_leaf": [1, 3,5,6]}

model_RF_GS = GridSearchCV(RandomForestRegressor(), param_grid=params_RF,cv=5)
model_RF_GS.fit(X,Y)

Outer Fold: We now perform K-Fold Cross Validation on the above-created model.

python

nested_RF_score = cross_val_score(model_RF_GS, X=X, y=Y, cv=5)

Cross-Validation Scores: We compute the accuracy scores obtained from each of the 5 outer folds.

python

nested_RF_score

array([0.63536656, 0.7823985 , 0.64599239, 0.6138917 , 0.42870927])

Final Score: We now average out all the cross-validation scores and come up with a single, final score.

python

print("RSquare: %0.2f (+/- %0.2f)" % (nested_RF_score.mean(), nested_RF_score.std() * 2))

RSquare: 0.62 (+/- 0.23)

We get a 62% accuracy when we use Nested K-Fold Cross-Validation. Such a low accuracy score might be because the model is not overfitting now and this is how the model will perform when is used on an unseen dataset.

In this blog post, we found how the accuracy score can change dramatically upon changing the cross-validation technique. This indicates that the problem of overfitting is ubiquitous and can be highly detrimental if it is not properly addressed. To save our model from overfitting and for getting an unbiased picture of our model's performance various cross-validation techniques can be used. While the various k-fold cross-validation methods are good at addressing the problem of overfitting, they are computationally expensive and can cause problems especially when the dataset is large enough. On the other hand, holdout method is simple and is useful especially when dealing with large datasets but can lead the model to overfit. Thus it becomes important to choose the right model validation technique. All such cross-validation methods are also explored in R in Model Validation in R.