home/modeling/application/python

// application · python

Regression Problems in Python

In the Theory Section of Regression Problems, a number of regression algorithms have been explored. In this blog, we will create models using those algorithms to predict the price of houses, using the following methods/algorithms:

We also perform tuning of the hyperparameters, which is done to improve the accuracy of our model and save it from overfitting. There are mainly three ways to tune these parameters: Grid Search, Random Search and Bayesian Optimization. In this blog, we will tune our parameters using the first two methods and see how the accuracy score is affected.

It is important to understand the difference between Grid Search and Random Search. In Random Search, when dealing with continuous parameters, it is important to specify a continuous distribution of plausible parameters to take full advantage of the randomization - increasing n_iter will always lead to a finer search. For each parameter, we give a range of plausible hyperparameter values; unlike GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings to be tried is declared through n_iter.

GridSearchCV means Grid Search Cross-Validation, wherein the program runs grid search with cross-validation. In grid search cross-validation, all combinations of parameters are searched to find the best model. The cross-validation in the code follows a k-fold cross-validation process, where the dataset is divided into train, validation and test sets. After finding the best parameter values using Grid Search, we predict on the test dataset, i.e. a kind of unseen dataset. Cross-validation helps avoid overfitting of the model. Refer to Model Validation Techniques under the Theory Section for a better understanding of the concept; hyperparameter tuning with cross-validation is discussed in Model Validation in Python under the Application Section.

In this blog, we will perform Grid Search and Random Search without explicitly mentioning the number of folds required for cross-validation. By default, Grid Search and Random Search perform a minimum of three-fold cross-validation when tuning parameters.

Importing Libraries and Preparing Dataset

Before using the dataset to create regression models, we need to import the necessary libraries and prepare the dataset.

Importing Libraries

We begin by importing the libraries that will be used throughout this blog: numpy and pandas for data handling, matplotlib for plotting, and a number of utilities from scikit-learn and scipy that we will need for model building, prediction, and hyperparameter tuning.

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
python
from sklearn.model_selection import train_test_split
python
from sklearn import metrics

We also import the libraries required for Grid Search and Random Search hyperparameter tuning.

python
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from scipy.stats import randint as sp_randint

Importing Dataset

We will be using the Boston housing dataset, which is available within scikit-learn. We load it and assemble it into a DataFrame, adding the target variable (median house value) as a Price column.

python
from sklearn.datasets import load_boston
BostonData = load_boston()
BosData = pd.DataFrame(BostonData.data)
BosData.columns = BostonData.feature_names
BosData['Price']=BostonData.target
BosData.head()
load_boston has been removed. scikit-learn deprecated and then removed the Boston housing dataset in version 1.2 (over ethical concerns), so from sklearn.datasets import load_boston fails on current scikit-learn. To follow along, load an equivalent copy of the data - for example fetch_openml(name='boston', version=1, as_frame=True) or a CSV - and assign the target to a Price column. The code and outputs below are kept as originally run.
BosData dataset showing CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT and Price columns
Output: the BosData dataset.

Checking for Skewness

Before modeling, it is useful to check the distribution of the dependent variable. We plot a histogram of Price.

python
BosData.hist(column='Price',figsize=(8,8))
Histogram of Price, showing a right-skewed distribution
Output: histogram of Price.

The distribution seems to be skewed. For more certainty, we use the .skew() command to measure the exact skewness.

python
BosData['Price'].skew()
1.1080984082549072

As the data is positively skewed, we will perform a transformation to reduce this skewness. We perform a log transformation on the dependent variable.

python
BosData['ln_Price'] = np.log(BosData['Price'])
BosData.head()
BosData dataset with an added ln_Price column
Output: BosData with the ln_Price column.

A histogram can be created to check the distribution of the log-transformed dependent variable.

python
BosData.hist(column='ln_Price',figsize=(8,8))
Histogram of ln_Price, showing a less skewed, more symmetric distribution
Output: histogram of ln_Price.

We now check for the measure of skewness in the ln_Price variable.

python
BosData['ln_Price'].skew()
-0.33032129530987864

We decide to proceed with the log-transformed variable, as it reduces the skewness of the dependent variable.

Train and Test Split

We split the dataset into a Train and a Test set, using ln_Price as the target variable.

python
X = BosData[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']]
Y = BosData['ln_Price']
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=123)

The Process of Creating Regression Models

The general process we follow for creating each regression model in this blog is as follows:

Linear Regression

To understand how Linear Regression works, refer to the blog on Linear Regression in the Theory Section. Here, we will use the Linear Regression algorithm to predict the price of houses.

Import Libraries

We begin by importing the linear_model module from scikit-learn.

python
from sklearn import linear_model

Initializing Linear Regression Model

Here we initialize the Linear Regression model. We are using no hyperparameters and simply use LinearRegression() to initialize.

python
linreg_model = linear_model.LinearRegression()

Fitting Model

We now fit the above-created model on the Train dataset.

python
linreg_model.fit(X_train,Y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Prediction

The Linear Regression model is used to predict the Y variable in the Test dataset.

python
pred_linmodel = linreg_model.predict(X_test)

Calculating Accuracy

We also calculate the accuracy of the model. To do so, we first import metrics from scikit-learn and calculate the R² score, which tells us about the model's performance on the Test dataset. Note that this procedure will be followed for checking the accuracy of all the upcoming regression models.

python
from sklearn import metrics
metrics.r2_score(Y_test,pred_linmodel)
0.7194695009566641

This model provides us with 71% accuracy. Note that this is still not a very reliable measure, and we need to compute many more metrics to evaluate the model's performance, which has been explored in Model Validation in Python.

Regularized Linear Regression

Regularized Regression is mainly of two types: Ridge and Lasso. Refer to Regularized Regression Algorithms under the Theory Section to understand the difference between the two. The linear_model module has separate algorithms for Lasso and Ridge, as compared to regularized logistic regression packages, where we just have to declare the penalty (penalty='l1' for Lasso and penalty='l2' for Ridge classification). A third type is Elastic Net Regularization, which is a combination of both penalties, l1 and l2 (Lasso and Ridge).

Import StandardScaler to Standardize the Dataset

We first have to scale the data, as Regularized Regression penalizes the coefficients and hence we cannot have variables on different scales of measurement. Various regression models require scaling of data, such as Regularized Logistic Regression (Lasso and Ridge), KNN, SVM and ANN. (We will be using the same scaled dataset for KNN to predict house prices as well.) As only continuous independent variables are to be considered for scaling, we first isolate them.

python
from sklearn.preprocessing import StandardScaler
BosData_Scaled = BosData[['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']]
BosData_Scaled.head()
BosData_Scaled dataset with the continuous independent variables, unscaled
Output: the BosData_Scaled dataset.

We now apply scaling on these numerical features.

python
scaler = StandardScaler()
scaler.fit(BosData_Scaled)
BosData_scaled = scaler.transform(BosData_Scaled)
BosData_scaled
BosData_scaled as a NumPy array of standardized values
Output: BosData_scaled as an array.

We convert the above output, which is in array format, into a DataFrame.

python
Scaled_BosData = pd.DataFrame(BosData_scaled,columns=BosData_Scaled.columns)
Scaled_BosData.head()
Scaled_BosData DataFrame with standardized values
Output: the Scaled_BosData dataset.

In this step, we concatenate the scaled variables with the leftover categorical variable.

python
BosData_chas = BosData[['CHAS']]
BosData_Final = pd.concat([BosData_chas,Scaled_BosData],axis=1)

Splitting Dataset into Train and Test

Here we split the dataset into Train and Test. Note that these datasets will be used again when we deal with KNN.

python
X1_train,X1_test,Y1_train,Y1_test=train_test_split(BosData_Final,BosData['ln_Price'],test_size=0.3,random_state=123)

Lasso

Initialize and Fit Model: We build a Lasso Regression model, which uses an l1 penalty, and fit it on the Train dataset.

python
Lasso = linear_model.Lasso(alpha=0.01)
Lasso.fit(X1_train,Y1_train)

Prediction and Calculate Accuracy: In this step, we predict the dependent variable on the Test dataset and calculate its R².

python
pred1 = Lasso.predict(X1_test)
metrics.r2_score(Y1_test,pred1)
0.6731866018864561

The accuracy of this model comes out to be 67%.

Ridge

Building and Fitting Model: We build the Ridge Regression model and fit it on the Train dataset.

python
Ridge = linear_model.Ridge()
Ridge.fit(X1_train,Y1_train)
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001)

Find Coefficients: We can use the Ridge.coef_ command to see the coefficients derived from this Ridge Regression model.

python
Ridge.coef_
array([ 0.03509249, -0.08676156, 0.01775052, 0.02962073, -0.07658582, 0.09160703, 0.0005345 , -0.08790869, 0.11657605, -0.11281724, -0.08067278, 0.02294539, -0.19790002])

Prediction: We predict the dependent variable on the Test dataset.

python
ridge_test_pred = pd.DataFrame({'actual':Y1_test,'predicted':Ridge.predict(X1_test)})

Calculate Accuracy: We now calculate the accuracy, to compare results with the simple Linear and Lasso Regression models.

python
score_Ridge = metrics.r2_score(ridge_test_pred['actual'],ridge_test_pred['predicted'])
score_Ridge
0.7190344443426893

The accuracy provided by Ridge is 71%, which is more than that of Lasso.

Elastic Net

For Regression, we use the ElasticNet model, whereas for Classification we use the Stochastic Gradient Descent Classifier.

Import Library: Here we import linear_model from scikit-learn to create an Elastic Net Regression model.

python
from sklearn import linear_model

Initialize Model: In this step we use linear_model.ElasticNet and initialize the model, setting the alpha value at 0.01.

python
EN = linear_model.ElasticNet(alpha=0.01)

Fit Model: We now fit the model on the Train dataset.

python
EN.fit(X1_train,Y1_train)
ElasticNet(alpha=0.01, copy_X=True, fit_intercept=True, l1_ratio=0.5, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

Prediction: We predict the dependent variable on the Test dataset.

python
pred_EN = EN.predict(X1_test)

Calculate Accuracy: We now calculate the accuracy of the Elastic Net Regression model.

python
metrics.r2_score(Y1_test,pred_EN)
0.6937931498104792

This model provides us with 69% accuracy.

Tuning of Parameters

We will now tune the parameters for Regularized Regression using Grid Search and Random Search. As discussed above, these methods will run the model with various parameters and provide us with the best parameter values. Here we look for the best value of alpha, and upon finding it we fit the model on the Train dataset, predict the values on the Test dataset, and calculate the accuracy score using the metrics package. For Elastic Net, we additionally tune a parameter called l1_ratio, the ElasticNet mixing parameter, whose value is taken between 0 and 1.

Grid Search - Ridge

Import GridSearchCV Library: We import GridSearchCV from sklearn.model_selection, which we will use to tune hyperparameters.

python
from sklearn.model_selection import GridSearchCV

Defining Parameters: Parameters have to be defined first before they can be used in Grid Search. The range of parameters is defined differently in Grid Search and Random Search - Grid Search requires discrete values, whereas Random Search uses a range of values for parameters.

python
params_Ridge = {'alpha': np.array([1,0.1,0.01,0.001,0.0001,0])}

Building and Fitting Model: We now build the Regularized Linear Regression model using Grid Search and fit it on the Train dataset.

python
Ridge_GS = GridSearchCV(Ridge, param_grid=params_Ridge)
Ridge_GS.fit(X1_train,Y1_train)
GridSearchCV(cv=None, error_score='raise', estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001), fit_params=None, iid=True, n_jobs=1, param_grid={'alpha': array([1.e+00, 1.e-01, 1.e-02, 1.e-03, 1.e-04, 0.e+00])}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: The best_params_ command can be used to find the best parameters.

python
Ridge_GS.best_params_
{'alpha': 1.0}

The best value of alpha is considered to be 1.

Prediction: We predict the house prices on the Test dataset.

python
pred_Ridge_GS = Ridge_GS.predict(X1_test)

Calculate Accuracy: We now compute the accuracy of this model.

python
metrics.r2_score(Y1_test,pred_Ridge_GS)
0.7190344443426891

The accuracy comes out to be 71%.

Note - here we have used Ridge Regression. You can perform the same steps mentioned above for hyperparameter tuning of a Lasso Regression model.

Grid Search - Elastic Net

Defining Parameters: For the Elastic Net Regression model, we will tune two parameters: alpha and l1_ratio.

python
params_EN = {'alpha' :np.array([1,0.1,0.01,0.001,0.0001,0]),
         'l1_ratio' :np.array([0.1,0.01,0.001,0.0001,1])}

Building and Fitting Model: We now build the model using Grid Search and fit it on the Train dataset.

python
EN_GS = GridSearchCV(linear_model.ElasticNet(), param_grid=params_EN)
EN_GS.fit(X1_train,Y1_train)
GridSearchCV(cv=None, error_score='raise', estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_start=False), fit_params=None, iid=True, n_jobs=1, param_grid={'alpha': array([1.e+00, 1.e-01, 1.e-02, 1.e-03, 1.e-04, 0.e+00]), 'l1_ratio': array([1.e-01, 1.e-02, 1.e-03, 1.e-04, 1.e+00])}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: The best_params_ command can be used to find the best parameters.

python
EN_GS.best_params_
{'alpha': 0.01, 'l1_ratio': 0.01}

Prediction: We now predict using this model on the Test dataset.

python
pred_EN_GS = EN_GS.predict(X1_test)

Calculate Accuracy: We compute the accuracy of this Elastic Net Regression model.

python
metrics.r2_score(Y1_test,pred_EN_GS)
0.717387302093702

The accuracy comes out to be 71%.

Random Search - Ridge

Import RandomizedSearchCV Library: We import RandomizedSearchCV from sklearn.model_selection, which will allow us to tune hyperparameters.

python
from sklearn.model_selection import RandomizedSearchCV

Declaring Parameters: As the values of alpha can be in decimals, we use a uniform distribution to define the range of the parameter alpha, importing uniform from scipy.stats. For integers we can use sp_randint, which takes random integer values in a range. Here we import uniform from scipy.stats and declare a list of plausible parameters.

python
from scipy.stats import uniform
params_Ridge_RS = {'alpha': uniform(0.0001,1)}

Building Model: We build the model and search for the best parameters. Here we run 100 iterations to come up with the best value of the parameter.

python
Ridge_RS = RandomizedSearchCV(Ridge, param_distributions=params_Ridge_RS,n_iter=100)

Fitting Model: We now fit the model on the Train dataset.

python
Ridge_RS.fit(X1_train,Y1_train)
RandomizedSearchCV(cv=None, error_score='raise', estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001), fit_params=None, iid=True, n_iter=100, n_jobs=1, param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000B789748>}, pre_dispatch='2*n_jobs', random_state=None, refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: The best_params_ command can be used to find the best parameters.

python
Ridge_RS.best_params_
{'alpha': 0.9937662069031356}

The best value of alpha is considered to be 0.9937662069031356.

Predict and Check Accuracy: The above model is used to predict the values of the dependent variable on the Test dataset, and the accuracy is also calculated.

python
pred_Ridge_RS = Ridge_RS.predict(X1_test)
metrics.r2_score(Y1_test,pred_Ridge_RS)
0.7193195584902956

The accuracy comes out to be 71%, the same as when we used alpha=1, which was provided by Grid Search.

Note - here we had used Ridge Regression. You can perform the same steps mentioned above for hyperparameter tuning of a Lasso Regression model.

Random Search - Elastic Net

Defining Parameters: We provide a distribution of values for the two hyperparameters.

python
params_EN_RS = {'alpha': uniform(0.0001,1),
               'l1_ratio':uniform(0.0001,1) }

Building and Fitting Model: We now build a model using RandomizedSearchCV and fit it on the Train dataset.

python
EN_RS = RandomizedSearchCV(linear_model.ElasticNet(), param_distributions=params_EN_RS,n_iter=100)
EN_RS.fit(X1_train,Y1_train)
RandomizedSearchCV(cv=None, error_score='raise', estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_start=False), fit_params=None, iid=True, n_iter=100, n_jobs=1, param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000BCAC9E8>, 'l1_ratio': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000BF48438>}, pre_dispatch='2*n_jobs', random_state=None, refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We use the best_params_ command and check for the best parameters obtained from RandomizedSearchCV.

python
EN_RS.best_params_
{'alpha': 0.01702207747427843, 'l1_ratio': 0.13810137653130106}

Predict and Check Accuracy: The above model is used to predict the values of the dependent variable on the Test dataset. In this step, we also check the accuracy obtained by the model.

python
pred_EN_RS = EN_RS.predict(X1_test)
metrics.r2_score(Y1_test,pred_EN_RS)
0.704539839099354

We get 70% accuracy from the above model.

Decision Tree Regressor

Decision Trees allow us to come up with flowcharts structured as trees, which allow us to predict the value of the dependent variable. Their inner workings have been explained in Decision Trees under the Theory Section.

This algorithm does not require scaled data, so we use the same Train and Test dataset components as used in the Linear Regression model: X_train, Y_train, X_test and Y_test.

Importing Libraries

We import DecisionTreeRegressor from sklearn.tree, which allows us to create a Decision Tree Regression model.

python
from sklearn.tree import DecisionTreeRegressor

Initializing and Fitting Decision Tree Model

Here we initialize the Decision Tree model. We are using no hyperparameters and simply use DecisionTreeRegressor() to initialize, and then fit this model on the Train dataset.

python
DTR = DecisionTreeRegressor()
DTR.fit(X_train,Y_train)
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')

Prediction and Calculating Accuracy

The Decision Tree model is used to predict the Y variable on the Test dataset. We also check the accuracy of this model on the Test dataset.

python
Pred_DTR = DTR.predict(X_test)
metrics.r2_score(Y_test,Pred_DTR)
0.6630413767880653

The accuracy of this Decision Tree model comes out to be 66%.

Tree Visualization

We can visualize the above-created Decision Tree. This helps in further understanding how the decision tree algorithm is working.

Install and Import graphviz: We have to install graphviz in Python by typing pip install graphviz in the command window. Once done, we can import the graphviz library in Python.

python
import graphviz

Creating Decision Tree Visualization: We now import tree from scikit-learn and then use graphviz to create the Decision Tree flowchart.

python
dot_data = tree.export_graphviz(DTR, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("train_set1")
'train_set1.pdf'

Decision Tree with Labels: The above command produces a decision tree flowchart without variable names, so we use the following code to produce the decision tree flowchart with variable names.

python
dot_data = tree.export_graphviz(DTR, out_file=None, max_depth=3,
                               feature_names=X.columns,
                         filled=True, rounded=True,
                         special_characters=True)
graph2 = graphviz.Source(dot_data)
graph2.render('train_setlinear')

Opening Saved Image: We import os and use the os.startfile command to open the PDF file of the tree chart.

python
import os
os.startfile('train_setlinear')
Decision tree flowchart with labeled features, splitting first on LSTAT
Output: the labeled Decision Tree flowchart.

Tuning Hyperparameters

To show an example of how hyperparameters can be tuned, we take the following four parameters: max_features, the number of features to be considered when looking for the best split; the minimum sample split; the minimum sample leaf; and max_depth, the maximum depth of the tree. All such hyperparameters have been explained in the Theory Section. You may select other parameters if you want more control over the process and desire more accuracy.

Grid Search

Defining Parameters: Here we define the plausible values for our four parameters.

python
params = {'max_features': ['auto', 'sqrt', 'log2'],
          'min_samples_split': [2,3,4,5,6,7,8,9,10,11,12,13,14,15],
          'min_samples_leaf':[1,2,3,4,5,6,7,8,9,10,11],
         'max_depth':[2,3,4,5,6,7,8]}
max_features='auto' is deprecated. For tree, random-forest and gradient-boosting estimators, 'auto' was deprecated in scikit-learn 1.1 and removed in 1.3. Use 'sqrt' (the same behaviour the old 'auto' had for these models), 'log2', or a float. The grids and results shown here (used for several models below) are as originally run.

Initializing Decision Tree: We now initialize the Decision Tree Regression model.

python
DTR = DecisionTreeRegressor()

Building and Fitting Model: We now build a model using GridSearchCV and fit it on the Train dataset.

python
DTR_GS = GridSearchCV(DTR, param_grid=params)
DTR_GS.fit(X_train,Y_train)
GridSearchCV(cv=None, error_score='raise', estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best'), fit_params=None, iid=True, n_jobs=1, param_grid={'max_features': ['auto', 'sqrt', 'log2'], 'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 'max_depth': [2, 3, 4, 5, 6, 7, 8]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: The best_params_ command can be used to find the best parameters.

python
DTR_GS.best_params_
{'max_depth': 4, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 4}

Predict and Check Accuracy: The above model, with the values of hyperparameters found above, is used to predict the values of the dependent variable on the Test dataset, and the accuracy is also calculated.

python
pred_DTR_GS = DTR_GS.predict(X_test)
metrics.r2_score(Y_test,pred_DTR_GS)
0.6853082002078099

The accuracy comes out to be 68%.

Random Search

Defining Parameters: As mentioned earlier, for parameters that are integers we use sp_randint, which we import from scipy.stats.

python
from scipy.stats import randint as sp_randint

param_grid2 = {'max_features': ['auto', 'sqrt', 'log2'],
          'min_samples_split': sp_randint(2,15),
          'min_samples_leaf': sp_randint(1,11),
          'max_depth':sp_randint(2,8)}

Building and Fitting Model: We now build a model using RandomizedSearchCV and fit it on the Train dataset.

python
DTR_RS = RandomizedSearchCV(DTR, param_distributions=param_grid2,n_iter=100)
DTR_RS.fit(X_train,Y_train)
RandomizedSearchCV(cv=None, error_score='raise', estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best'), fit_params=None, iid=True, n_iter=100, n_jobs=1, param_distributions={'max_features': ['auto', 'sqrt', 'log2'], 'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000B442C18>, 'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000B442D30>, 'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000B442EF0>}, pre_dispatch='2*n_jobs', random_state=None, refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We use the best_params_ command and check for the best parameters obtained from RandomizedSearchCV.

python
DTR_RS.best_params_
{'max_depth': 4, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 5}

Predict and Check Accuracy: The above model with the best parameters is used to predict the values of the dependent variable on the Test dataset, and the accuracy is also calculated.

python
pred_DTR_RS = DTR_RS.predict(X_test)
metrics.r2_score(Y_test,pred_DTR_RS)
0.6853082002078101

The accuracy comes out to be 68%.

K Nearest Neighbour

KNN is a distance-based algorithm which predicts a value based on the number of class observations found in its neighbourhood. For a detailed understanding of KNN, refer to K Nearest Neighbour under the Theory Section.

Importing KNeighborsRegressor Package

To run KNN in Python, we require KNeighborsRegressor, which we import from sklearn.neighbors.

python
from sklearn.neighbors import KNeighborsRegressor

Initializing and Fitting KNN Model

In this step, we first initialize the KNN model and then fit it on the Train dataset. Note that this is the Train dataset we used earlier for Regularized Regression - as discussed above, KNN requires standardized data, since it uses distance as a parameter for its functioning. We therefore use a dataset with all numerical observations scaled, except the target variable, using the same datasets as used for Regularized Regression to predict the value of Price on the Test dataset.

python
knnr = KNeighborsRegressor()
knnr.fit(X1_train,Y1_train)
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform')

Predict and Check Accuracy

The above model is used to predict the values of the dependent variable on the Test dataset, and the accuracy is calculated.

python
pred_knnr = knnr.predict(X1_test)
metrics.r2_score(Y1_test,pred_knnr)
0.7688008263271461

The accuracy comes out to be 76%.

Tuning Hyperparameters

In this blog post, we will tune four hyperparameters for KNN: n_neighbors, where we declare the different numbers of neighbours that can be considered; algorithm, where we can use methods such as K-D trees, which help speed up the testing phase; leaf_size, which helps the algorithm decide when to switch from the usual brute force method to tree-based methods such as K-D trees; and weights, where alternatives to majority voting can be provided. (All such hyperparameters have been explained in the theoretical explanation of KNN, so one can refer to it for more information. There are also a number of other hyperparameters to choose from, which you may select if you want more control over the process and desire more accuracy.)

Grid Search

Defining Parameters: We define the values for our four parameters.

python
params_knn = {'n_neighbors':[5,6,7,8,9,10],
          'leaf_size':[1,2,3,5],
          'weights':['uniform', 'distance'],
          'algorithm':['auto', 'ball_tree','kd_tree','brute']}

Building and Fitting Model: We now build a KNN model using GridSearchCV and fit it on the Train dataset.

python
model_knn1 = GridSearchCV(knnr, param_grid=params_knn)
model_knn1.fit(X1_train,Y1_train)
GridSearchCV(cv=None, error_score='raise', estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform'), fit_params=None, iid=True, n_jobs=1, param_grid={'n_neighbors': [5, 6, 7, 8, 9, 10], 'leaf_size': [1, 2, 3, 5], 'weights': ['uniform', 'distance'], 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We use best_params_ to check for the best parameters.

python
model_knn1.best_params_
{'algorithm': 'auto', 'leaf_size': 1, 'n_neighbors': 5, 'weights': 'distance'}

Predict and Check Accuracy: The above model, with the above-mentioned values of hyperparameters, is used to predict the values of the dependent variable on the Test dataset, and the accuracy is also calculated.

python
pred_knnr_GS = model_knn1.predict(X1_test)
metrics.r2_score(Y1_test,pred_knnr_GS)
0.783204473268023

The accuracy obtained from this model is 78%.

Random Search

Defining Parameters: We define a range of values for our four parameters.

python
params_knn_rs = {'n_neighbors':sp_randint(5,10),
          'leaf_size':sp_randint(1,5),
          'weights':['uniform', 'distance'],
          'algorithm':['auto', 'ball_tree','kd_tree','brute']}

Building and Fitting Model: We now build a KNN model using RandomizedSearchCV and fit it on the Train dataset.

python
KNN_RS1 = RandomizedSearchCV(knnr, param_distributions=params_knn_rs,n_iter=100)
KNN_RS1.fit(X1_train,Y1_train)
RandomizedSearchCV(cv=None, error_score='raise', estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform'), fit_params=None, iid=True, n_iter=100, n_jobs=1, param_distributions={'n_neighbors': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000BCC5BE0>, 'leaf_size': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000BCC5E80>, 'weights': ['uniform', 'distance'], 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}, pre_dispatch='2*n_jobs', random_state=None, refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We now check for the best parameters obtained using Random Search.

python
KNN_RS1.best_params_
{'algorithm': 'ball_tree', 'leaf_size': 2, 'n_neighbors': 5, 'weights': 'distance'}

Predict and Check Accuracy: We use the above model and the best combination of hyperparameters to predict the values of the dependent variable on the Test dataset, and the accuracy is also calculated.

python
pred_knn_RS = KNN_RS1.predict(X1_test)
metrics.r2_score(Y1_test,pred_knn_RS)
0.783204473268023

The accuracy comes out to be 78%.

Ensemble Models

In the Theory Section, under Ensemble Methods, various kinds of ensemble techniques have been explored. Here we will explore all those ensemble techniques using Python.

Bagging Regressor

Bagging Regressor is an ensemble estimator which fits a base estimator on each random subset of the Train dataset, and then aggregates their individual predictions to form a final prediction using a voting or averaging method. Here the base estimator is Decision Trees.

Importing BaggingRegressor Library

To create a Bagging Regressor model in Python, we import BaggingRegressor from sklearn.ensemble.

python
from sklearn.ensemble import BaggingRegressor

Initializing and Fitting Model

We initialize the Bagging model and fit it on the Train dataset.

python
baggingR = BaggingRegressor()
baggingR.fit(X_train,Y_train)
BaggingRegressor(base_estimator=None, bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)

Note that if base_estimator is None, the base estimator is a Decision Tree. bootstrap tells the model whether to draw samples with replacement; bootstrap_features is for whether features are drawn with replacement.

Predict and Check Accuracy

The above model is used to predict the values of the dependent variable on the Test dataset and check its accuracy.

python
bag_test_pred = baggingR.predict(X_test)
metrics.r2_score(Y_test,bag_test_pred)
0.8215941785595104

The accuracy obtained from this Bagging Regression model is 82%.

Tuning Hyperparameters

Here we tune five parameters: n_estimators, max_features, max_samples, bootstrap and bootstrap_features.

Grid Search

Defining Parameters: Here we define our five parameters.

python
params_bag_GS = {"n_estimators": [50,100,200],
               "max_features":[1,2,4,6,8],
               "max_samples": [0.5,0.1],
             "bootstrap": [True, False],
          "bootstrap_features": [True, False]}

Initializing, Building and Fitting Model: In this step, we initialize and build the Bagging Regression model using GridSearchCV and fit it on the Train dataset.

python
Bag_model_GS = GridSearchCV(baggingR, param_grid=params_bag_GS)
Bag_model_GS.fit(X_train,Y_train)
GridSearchCV(cv=None, error_score='raise', estimator=BaggingRegressor(base_estimator=None, bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params=None, iid=True, n_jobs=1, param_grid={'n_estimators': [50, 100, 200], 'max_features': [1, 2, 4, 6, 8], 'max_samples': [0.5, 0.1], 'bootstrap': [True, False], 'bootstrap_features': [True, False]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We now check the best combination of parameters.

python
Bag_model_GS.best_params_
{'bootstrap': False, 'bootstrap_features': False, 'max_features': 8, 'max_samples': 0.5, 'n_estimators': 100}

Predict and Check Accuracy: We use this model to predict the dependent variable on the Test dataset and check its accuracy.

python
pred_bag_GS = Bag_model_GS.predict(X_test)
metrics.r2_score(Y_test,pred_bag_GS)
0.8260755379106075

The accuracy comes out to be 82%.

Random Search

Defining Parameters: We provide a distribution of values for the five hyperparameters.

python
params_bagR_RS = {"n_estimators": sp_randint(50,200),
              "max_features":sp_randint(1,8),
              "max_samples": uniform(0.5,0.1),
            "bootstrap": [True, False],
         "bootstrap_features": [True, False]}

Building and Fitting Model: We now build a model using RandomizedSearchCV and fit it on the Train dataset.

python
BagR_RS = RandomizedSearchCV(baggingR, param_distributions=params_bagR_RS,n_iter=120)
BagR_RS.fit(X_train,Y_train)
RandomizedSearchCV(cv=None, error_score='raise', estimator=BaggingRegressor(base_estimator=None, bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params=None, iid=True, n_iter=120, n_jobs=1, param_distributions={'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000C0ABC18>, 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000C0ABB70>, 'max_samples': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000C2A8CF8>, 'bootstrap': [True, False], 'bootstrap_features': [True, False]}, pre_dispatch='2*n_jobs', random_state=None, refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We use the best_params_ command and check for the best parameters obtained from RandomizedSearchCV.

python
BagR_RS.best_params_
{'bootstrap': False, 'bootstrap_features': False, 'max_features': 7, 'max_samples': 0.5219575872007658, 'n_estimators': 167}

Predict and Check Accuracy: The above model is used to predict the values of the dependent variable on the Test dataset. In this step we also check the accuracy obtained by the model.

python
pred_bagR_RS = BagR_RS.predict(X_test)
metrics.r2_score(Y_test,pred_bagR_RS)
0.8357556872487533

The accuracy comes out to be 83%.

Random Forest Regressor

Random Forest Regressor is a variant of Bagging Regressor, and more about it can be found in the blog Bagging available in the Theory Section.

Importing RandomForestRegressor Library

We have to import RandomForestRegressor from sklearn.ensemble to run a Random Forest Regression model.

python
from sklearn.ensemble import RandomForestRegressor

Initializing and Fitting Model

We initialize the Random Forest model and then fit it on the Train dataset.

python
rfr = RandomForestRegressor()
rfr.fit(X_train,Y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)

Predict and Check Accuracy

The above model is used to predict the values of the dependent variable on the Test dataset. We also check the model's performance.

python
rfr_test_pred = rfr.predict(X_test)
metrics.r2_score(Y_test, rfr_test_pred)
0.793422122568467

The accuracy obtained from this Random Forest Regression model is 79%.

Tuning Hyperparameters

Here we tune four parameters: max_depth, max_features, min_samples_split and min_samples_leaf.

Grid Search

Defining Parameters: First, we define our four parameters.

python
params_RF = {"max_depth": [3,5,6,7,8,9],
              "max_features":['auto', 'sqrt', 'log2'],
              "min_samples_split": [2, 3,5,7],
              "min_samples_leaf": [1, 3,5,6]}

Initializing, Building and Fitting Model: In this step, we initialize and build the Random Forest Regression model using GridSearchCV and fit it on the Train dataset.

python
model_RF_GS = GridSearchCV(rfr, param_grid=params_RF)
model_RF_GS.fit(X_train,Y_train)
GridSearchCV(cv=None, error_score='raise', estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params=None, iid=True, n_jobs=1, param_grid={'max_depth': [3, 5, 6, 7, 8, 9], 'max_features': ['auto', 'sqrt', 'log2'], 'min_samples_split': [2, 3, 5, 7], 'min_samples_leaf': [1, 3, 5, 6]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We now check for the best combination of parameters.

python
model_RF_GS.best_params_
{'max_depth': 9, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 3}

Predict and Check Accuracy: We use this model to predict the dependent variable on the Test dataset and check its accuracy.

python
pred_RF_GS = model_RF_GS.predict(X_test)
metrics.r2_score(Y_test,pred_RF_GS)
0.7972859748762823

The accuracy comes out to be 79%.

Random Search

Defining Parameters: We provide a distribution of values for the four hyperparameters.

python
params_RF_RS = {"max_depth": sp_randint(3,9),
              "max_features":['auto', 'sqrt', 'log2'],
              "min_samples_split":sp_randint(2,7),
              "min_samples_leaf": sp_randint(1,6)}

Building and Fitting Model: We now build a model using RandomizedSearchCV and fit it on the Train dataset.

python
rfr_RS = RandomizedSearchCV(rfr, param_distributions=params_RF_RS,n_iter=100)
rfr_RS.fit(X_train,Y_train)
RandomizedSearchCV(cv=None, error_score='raise', estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params=None, iid=True, n_iter=100, n_jobs=1, param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000B302CF8>, 'max_features': ['auto', 'sqrt', 'log2'], 'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000B302940>, 'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000B302A58>}, pre_dispatch='2*n_jobs', random_state=None, refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We use the best_params_ command and check for the best parameters obtained from RandomizedSearchCV.

python
rfr_RS.best_params_
{'max_depth': 7, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 4}

Predict and Check Accuracy: The above model is used to predict the values of the dependent variable on the Test dataset. In this step, we also check the accuracy obtained by the model.

python
pred_RF_RS = rfr_RS.predict(X_test)
metrics.r2_score(Y_test,pred_RF_RS)
0.8014520848917908

The accuracy comes out to be 80%.

AdaBoost Regressor

The AdaBoost Regressor builds a regressor (decision tree), and if a training data point is mispredicted, the weight of that training data point is increased, i.e. it is boosted. Here again, Decision Tree Regressor is used as the base estimator.

Importing AdaBoostRegressor Package

We require AdaBoostRegressor, which we import from sklearn.ensemble, to create an AdaBoost Regression model.

python
from sklearn.ensemble import AdaBoostRegressor

Initializing and Fitting AdaBoost Model

In this step, we first initialize the AdaBoost model, and then fit this model on the Train dataset.

python
AdaBoost = AdaBoostRegressor()
AdaBoost.fit(X_train,Y_train)
AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear', n_estimators=50, random_state=None)

Predict and Check Accuracy

The above model is used to predict the values of the dependent variable on the Test dataset, and the accuracy is calculated.

python
AdaBoost_test_pred = AdaBoost.predict(X_test)
metrics.r2_score(Y_test,AdaBoost_test_pred)
0.7838558210260586

The accuracy comes out to be 78%.

Tuning Hyperparameters

In this blog post, we will tune the number of estimators and the learning rate.

Grid Search

Defining Parameters: We first define the values for our parameters.

python
params_AdbR_GS = {'learning_rate':[0.05,0.1,0.2,0.6,0.8,1],
        'n_estimators': [50,60,100],
                 'loss' : ['linear', 'square', 'exponential']}

Building and Fitting Model: We now build an AdaBoost model using GridSearchCV and fit it on the Train dataset.

python
model_AdaR_GS = GridSearchCV(AdaBoostRegressor(), param_grid=params_AdbR_GS)
model_AdaR_GS.fit(X_train,Y_train)
GridSearchCV(cv=None, error_score='raise', estimator=AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear', n_estimators=50, random_state=None), fit_params=None, iid=True, n_jobs=1, param_grid={'learning_rate': [0.05, 0.1, 0.2, 0.6, 0.8, 1], 'n_estimators': [50, 60, 100], 'loss': ['linear', 'square', 'exponential']}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We use the best_params_ function to check for the best parameters.

python
model_AdaR_GS.best_params_
{'learning_rate': 0.6, 'loss': 'square', 'n_estimators': 100}

Predict and Check Accuracy: The above model, with the above-mentioned values of hyperparameters, is used to predict the values of the dependent variable on the Test dataset, and the accuracy is also calculated.

python
pred_AdaR_GS = model_AdaR_GS.predict(X_test)
metrics.r2_score(Y_test,pred_AdaR_GS)
0.7856380310255039

The accuracy obtained from this model is 78%.

Random Search

Defining Parameters: We define a range of values for our two parameters.

python
params_AdbR_RS = {'learning_rate':uniform(0.05,1),
        'n_estimators': sp_randint(50,100),
                'loss' : ['linear', 'square', 'exponential']}

Building and Fitting Model: We now build an AdaBoost model using RandomizedSearchCV and fit it on the Train dataset.

python
AdaR_RS = RandomizedSearchCV(AdaBoostRegressor(), param_distributions=params_AdbR_RS,n_iter=100)
AdaR_RS.fit(X_train,Y_train)
RandomizedSearchCV(cv=None, error_score='raise', estimator=AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear', n_estimators=50, random_state=None), fit_params=None, iid=True, n_iter=100, n_jobs=1, param_distributions={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000B3BD748>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000B3BD1D0>, 'loss': ['linear', 'square', 'exponential']}, pre_dispatch='2*n_jobs', random_state=None, refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We now check for the best parameters obtained using Random Search.

python
AdaR_RS.best_params_
{'learning_rate': 0.8093273594487468, 'loss': 'square', 'n_estimators': 56}

Predict and Check Accuracy: We use the above model and the best combination of hyperparameters to predict the values of the dependent variable on the Test dataset, and the accuracy is also calculated.

python
pred_AdaR_RS = AdaR_RS.predict(X_test)
metrics.r2_score(Y_test,pred_AdaR_RS)
0.7848616580545453

The accuracy comes out to be 78%.

Gradient Boosting Regressor

Gradient Boosting Regressor is another type of Boosting Model. Refer to the blog Boosting under Ensemble Methods in the Theory Section to know more about it.

Importing GradientBoostingRegressor Library

To create a Gradient Boost Regression model in Python, we require importing GradientBoostingRegressor from sklearn.ensemble.

python
from sklearn.ensemble import GradientBoostingRegressor

Initializing and Fitting Model

We initialize the model and fit it on the Train dataset.

python
GBR = GradientBoostingRegressor()
GBR.fit(X_train,Y_train)
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, presort='auto', random_state=None, subsample=1.0, verbose=0, warm_start=False)

Predict and Check Accuracy

The above model is used to predict the values of the dependent variable on the Test dataset and check its accuracy.

python
GBR_test_pred = GBR.predict(X_test)
metrics.r2_score(Y_test, GBR_test_pred)
0.8444921275673007

The accuracy obtained from this Gradient Boosting Regression model is 84%.

Tuning Hyperparameters

Here we tune six parameters: max_depth, max_features, min_samples_split, min_samples_leaf, learning_rate and n_estimators.

Grid Search

Defining Parameters: Here we define our six parameters.

python
params_GBR_GS = {"max_depth": [3,5,6,7],
              "max_features":['auto', 'sqrt', 'log2'],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
            'learning_rate':[0.05,0.1,0.2],
            'n_estimators': [10,30,50,70]}

Initializing, Building and Fitting Model: In this step, we initialize and build the Gradient Boosting Regression model using GridSearchCV and fit it on the Train dataset.

python
model_GradR2_GS = GridSearchCV(GradientBoostingRegressor(), param_grid=params_GBR_GS)
model_GradR2_GS.fit(X_train,Y_train)
GridSearchCV(cv=None, error_score='raise', estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, presort='auto', random_state=None, subsample=1.0, verbose=0, warm_start=False), fit_params=None, iid=True, n_jobs=1, param_grid={'max_depth': [3, 5, 6, 7], 'max_features': ['auto', 'sqrt', 'log2'], 'min_samples_split': [2, 3, 10], 'min_samples_leaf': [1, 3, 10], 'learning_rate': [0.05, 0.1, 0.2], 'n_estimators': [10, 30, 50, 70]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We now check the best combination of parameters.

python
model_GradR2_GS.best_params_
{'learning_rate': 0.1, 'max_depth': 6, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 70}

Predict and Check Accuracy: We use this model to predict the dependent variable on the Test dataset and check its accuracy.

python
pred_GradR_GS = model_GradR2_GS.predict(X_test)
metrics.r2_score(Y_test,pred_GradR_GS)
0.8685388497033237

The accuracy comes out to be 86%.

Random Search

Defining Parameters: We provide a distribution of values for the six hyperparameters.

python
params_GBR_RS = {"max_depth":sp_randint(3,7),
              "max_features":['auto', 'sqrt', 'log2'],
              "min_samples_split": sp_randint(2,10),
              "min_samples_leaf": sp_randint(1,10),
                'learning_rate':uniform(0.05,0.2),
                'n_estimators':sp_randint(10,70)}

Building and Fitting Model: We now build a model using RandomizedSearchCV and fit it on the Train dataset.

python
GradR_RS = RandomizedSearchCV(GradientBoostingRegressor(), param_distributions=params_GBR_RS,n_iter=100)
GradR_RS.fit(X_train,Y_train)
RandomizedSearchCV(cv=None, error_score='raise', estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, presort='auto', random_state=None, subsample=1.0, verbose=0, warm_start=False), fit_params=None, iid=True, n_iter=100, n_jobs=1, param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000BF805F8>, 'max_features': ['auto', 'sqrt', 'log2'], 'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000BF80DD8>, 'min_samples_leaf': <scipy.stats._distn_infrastruct...F804E0>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000BF80550>}, pre_dispatch='2*n_jobs', random_state=None, refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We use the best_params_ command and check for the best parameters obtained from RandomizedSearchCV.

python
GradR_RS.best_params_
{'learning_rate': 0.14410563141822577, 'max_depth': 4, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 52}

Predict and Check Accuracy: The above model is used to predict the values of the dependent variable on the Test dataset. In this step we also check for the accuracy obtained by the model.

python
pred_GradR_RS = GradR_RS.predict(X_test)
metrics.r2_score(Y_test,pred_GradR_RS)
0.8552362957595668

The accuracy comes out to be 85%.

XGBoost Regressor

XGBoost stands for Extreme Gradient Boost, which is an advanced version of Gradient Boost.

Installing and Importing Library

We first download the file xgboost-0.7-cp36-cp36m-win_amd64.whl from https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost and install the file using the command prompt, as shown below.

Command prompt showing pip install of the xgboost-0.7 wheel file, completing with 'Successfully installed xgboost-0.7'
Output: installing the XGBoost wheel via the command prompt.

We now import XGBRegressor from xgboost to run an Extreme Gradient Boosting model.

python
from xgboost import XGBRegressor

Initializing and Fitting Model

We initialize the model and fit it on the Train dataset.

python
xgbr = XGBRegressor()
xgbr.fit(X_train,Y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1)

Prediction and Accuracy

The above model is used to predict the values of the dependent variable on the Test dataset. We also check the model's performance.

python
pred_xgbr = xgbr.predict(X_test)
metrics.r2_score(Y_test,pred_xgbr)
0.8534468157859386

The accuracy obtained from this XGBoost Regression model is 85%.

Tuning Hyperparameters

Here we tune four parameters: max_depth, min_child_weight, learning_rate and n_estimators.

Grid Search

Defining Parameters: First, we define our four parameters.

python
params_xgbR_GS = {"max_depth": [3,4,5,6,7,8],
              "min_child_weight" : [4,5,6,7,8],
            'learning_rate':[0.05,0.1,0.2,0.25,0.8,1],
            'n_estimators': [10,30,50,70,80,100]}

Initializing, Building and Fitting Model: In this step, we initialize and build the Extreme Gradient Regression model using GridSearchCV and fit it on the Train dataset. We also import warnings and run filterwarnings so that any unnecessary warning can be ignored.

python
import warnings
warnings.filterwarnings("ignore")

model_xgbR_GS = GridSearchCV(XGBRegressor(), param_grid=params_xgbR_GS)
model_xgbR_GS.fit(X_train,Y_train)
GridSearchCV(cv=None, error_score='raise', estimator=XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1), fit_params=None, iid=True, n_jobs=1, param_grid={'max_depth': [3, 4, 5, 6, 7, 8], 'min_child_weight': [4, 5, 6, 7, 8], 'learning_rate': [0.05, 0.1, 0.2, 0.25, 0.8, 1], 'n_estimators': [10, 30, 50, 70, 80, 100]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We now check for the best combination of parameters.

python
model_xgbR_GS.best_params_
{'learning_rate': 0.1, 'max_depth': 4, 'min_child_weight': 4, 'n_estimators': 100}

Predict and Check Accuracy: We use this model to predict the dependent variable on the Test dataset and check its accuracy.

python
pred_xgbR_GS = model_xgbR_GS.predict(X_test)
metrics.r2_score(Y_test,pred_xgbR_GS)
0.8437033612232108

The accuracy comes out to be 84%.

Random Search

Defining Parameters: We provide a distribution of values for the four hyperparameters.

python
params_xgbR_RS = {"max_depth":sp_randint(3,8),
              "min_child_weight" : sp_randint(4,8),
            'learning_rate':uniform(0.05,1),
            'n_estimators':sp_randint(10,100)}

Building and Fitting Model: We now build a model using RandomizedSearchCV and fit it on the Train dataset.

python
XGB_RS = RandomizedSearchCV(XGBRegressor(), param_distributions=params_xgbR_RS,n_iter=150)
XGB_RS.fit(X_train,Y_train)
RandomizedSearchCV(cv=None, error_score='raise', estimator=XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1), fit_params=None, iid=True, n_iter=150, n_jobs=1, param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000C7F8EB8>, 'min_child_weight': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000C7F8EF0>, 'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000C7F8278>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000C7F87F0>}, pre_dispatch='2*n_jobs', random_state=None, refit=True, return_train_score='warn', scoring=None, verbose=0)

Best Parameters: We use the best_params_ command and check for the best parameters obtained from RandomizedSearchCV.

python
XGB_RS.best_params_
{'learning_rate': 0.18687344945434453, 'max_depth': 7, 'min_child_weight': 4, 'n_estimators': 73}

Predict and Check Accuracy: The above model is used to predict the values of the dependent variable on the Test dataset. In this step, we also check for the accuracy obtained by the model.

python
pred_xgb_RS = XGB_RS.predict(X_test)
metrics.r2_score(Y_test,pred_xgb_RS)
0.8563089495625015

We get 85% accuracy from this model.

Stacking Regressor

Stacking is a method where we use multiple learning algorithms and get a result by combining the results of all these separate algorithms. In this blog, we will perform a Level-One stacking. To know more about it, refer to the blog Stacking under the Theory Section.

Importing Library

We import StackingRegressor, which will allow us to create a stacked regression model.

python
from mlxtend.regressor import StackingRegressor

Initiating Individual Models

We now initiate all the models that we require in the Level-0 and in the meta-layer of the stacked model.

python
mod1 = KNeighborsRegressor()
mod2 = RandomForestRegressor()
mod3 = Ridge()
lr = LinearRegression()

Initiating the Stacked Model

We finally initiate the stacked model, having KNN, Random Forest, Ridge Regression and Linear Regression in the level-0, and the Linear Regressor in the meta-layer.

python
sr = StackingRegressor(regressors=[mod1, mod2,mod3 ,lr],
                          meta_regressor=lr)

Fitting Model

We fit the stacked regression model on the Train dataset.

python
sr.fit(X_train,Y_train)
StackingRegressor(meta_regressor=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False), regressors=[KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform'), RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, ...er='auto', tol=0.001), LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)], store_train_meta_features=False, verbose=0)

Predicting and Checking Accuracy

We now predict the dependent variable on the Test dataset and, on the basis of these predictions, check the accuracy of this stacked model.

python
sr_pred = sr.predict(X_test)
metrics.r2_score(Y_test,sr_pred)
0.7910053569359522

We get a 79% accuracy from this model.

In this blog, we explored the various regression algorithms covered in the Theory Section of Modeling. All such algorithms have been put to use using R in the blog Regression Problems in R.

ESC
100 pages indexed · Esc to close