// application · python

Miscellaneous Methods in Python

Before we perform modeling on our datasets, we need to prepare the dataset first. This pre-processing of data includes various steps which were explored in Miscellaneous Methods under the section Data Exploration and Preparation. In this blog, we will explore the application of various data preparation methods in Python. These methods include the following:

Consolidation of Data Sets
Missing Value Treatment
Outlier Treatment

Note that univariate and bivariate analysis, which are methods of data exploration (and were discussed in the theory section under Data Exploration and Preparation), have already been discussed in the blogs related to the application of descriptive and inferential statistics, as these methods of analysis use various descriptive and inferential statistics.

Consolidation of Data Sets

There are mainly two ways of consolidating datasets: appending and merging.

Importing preliminary libraries: We first start off by adding some important libraries that will help us in working.

python

import pandas as pd
import numpy as np
import matplotlib
%matplotlib inline

Appending

Appending of datasets is concatenating or stacking datasets either row-wise or column-wise (whereas merging of datasets is combining datasets based on a common variable).

Import Dataset: Here we have taken two datasets of cars sold by a dealer. (Please note that all datasets used in the sections of coding are hypothetical. If any dataset has been taken from an outside source, the link of the source will be mentioned for that particular dataset.) Note that the variable names should be the same while appending datasets.

We begin with importing the first car dataset.

python

CarData1 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Car_Data1.xls")
CarData1

CarData1 dataset with Car Name and Sales in 1000s columns — Output: the CarData1 dataset.

We now upload the second car dataset.

python

CarData2 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Car_Data2.xls")
CarData2

CarData2 dataset with Car Name and Sales in 1000s columns — Output: the CarData2 dataset.

Appending Datasets Row-wise

Here we append the datasets by concatenating the two datasets row-wise. We use the following command to append the two datasets - CarData1 and CarData2.

python

row_concat = pd.concat([CarData1,CarData2])
row_concat

row_concat dataset, CarData1 and CarData2 stacked row-wise — Output: row_concat.

Exporting output of the appended dataset: We can use the to_excel command to export the results to an excel file.

python

row_concat.to_excel("C:/Users/user/Desktop/Data Sets/row_concat.xls")

Appending Datasets Column-wise

Here we append datasets by concatenating the two datasets column-wise. Here we use the function concat from the pandas library where we set the axis to 1, as by mentioning 'axis=1' we tell the program to concatenate datasets column-wise. (Note that the R equivalent of pandas concat is cbind.)

python

col_concat = pd.concat([CarData1,CarData2],axis=1)
col_concat

col_concat dataset, CarData1 and CarData2 joined column-wise — Output: col_concat.

Exporting output of the appended dataset: The way we did before, we use to_excel to export the output to excel.

python

col_concat.to_excel("C:/Users/user/Desktop/Data Sets/col_concat.xls")

Renaming Columns: We rename the columns so as to easily differentiate between the two datasets.

python

col_concat.columns=['Car_Name1','Sales in 1000s','Car_Name2','Sales in 1000s']
col_concat

col_concat dataset with renamed columns — Output: col_concat with renamed columns.

Merging

Merging of datasets is mainly used to combine datasets with common values. We use joins to merge datasets based on our requirement. First, we will do the normal merging of datasets. This is equivalent to appending of datasets column-wise. In merging, at least one variable has to be common in both datasets, on the basis of which merging will take place.

Importing Dataset: We import a dataset where, unlike CarData1 and CarData2 (where we had the sales information of the car), here we will have a Price variable.

python

CarData3 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Car_Data3.xls")

Simple Merge: We use the merge function from pandas. The following command is equivalent to appending the datasets column-wise.

python

Merge1 = pd.merge(CarData1,CarData3,on='Car Name',sort=False)
Merge1

Merge1 dataset, CarData1 merged with CarData3 on Car Name — Output: Merge1.

Ways of Joining Tables

As mentioned in Consolidation of Datasets, there are various ways of joining two datasets and we will be exploring some of them here.

Import Datasets: In order to explain different joins, we will be using two datasets of a store. One has information on the product bought by the customer, and the other dataset has the city of residence of that customer.

Importing the first Store dataset.

python

StoreData1 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Store_Data1.xls")
StoreData1

StoreData1 dataset with Cust_ID and Product columns — Output: the StoreData1 dataset.

Importing the second Store dataset.

python

StoreData2 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Store_Data2.xls")
StoreData2

StoreData2 dataset with Cust_ID and City columns — Output: the StoreData2 dataset.

We can now perform the various methods of merging these two datasets. In this blog, we will perform Inner Join, Outer Join, Left Join and Right Join.

Inner Join

An inner join is used to find the common information between the two datasets, i.e. it is the intersection of two datasets. Here it takes the common entries of the customers from both the datasets and makes a table with their corresponding entries.

python

Inner1 = pd.merge(StoreData1,StoreData2,how='inner',on='Cust_ID')
Inner1

Inner1 dataset, the inner join of StoreData1 and StoreData2 — Output: Inner1.

Outer Join

Outer Join takes all the entries of both the datasets even if the entry for a variable is missing for that customer in either of the datasets. Basically, it is the union of both the datasets.

python

Outer1 = pd.merge(StoreData1,StoreData2,how='outer',on='Cust_ID')
Outer1

Outer1 dataset, the outer join of StoreData1 and StoreData2 — Output: Outer1.

Left Join

Left join gives you the details of customers in dataset 1 and their corresponding entries from dataset 2. It does not take any entry of the customer that is present in dataset 2 and not in dataset 1.

python

Left1 = pd.merge(StoreData1,StoreData2,how='left',on='Cust_ID')
Left1

Left1 dataset, the left join of StoreData1 and StoreData2 — Output: Left1.

Right Join

Right Join is similar to the left join. While the Right Join takes all entries of dataset 2 and corresponding entries of dataset 1. In the left join, it's the other way around.

python

Right1 = pd.merge(StoreData1,StoreData2,how='right',on='Cust_ID')
Right1

Right1 dataset, the right join of StoreData1 and StoreData2 — Output: Right1.

Missing Value Treatment

The next main step of data preparation is missing value treatment. The data at hand may or may not have missing values. But if the dataset has missing values, then we will have to treat them first, either by removing them or replacing them with the mean, median or mode, depending on the type of the variable.

Import Dataset: We use a car sales dataset having multiple variables and perform missing value treatment on it.

python

carsales = pd.read_csv("C:/Users/user/Desktop/Data Sets/car_sales.csv")

Checking Missing Values: Before treating the missing values, we need to check if the data has missing values or not. In order to check for the missing values, we use the isnull function in Python. Here we count the missing values and take a sum of them so that for each variable we know the number of missing values present in them.

python

carsales.isnull().sum()

Manufacturer 0 Model 0 Sales_in_thousands 0 __year_resale_value 36 Vehicle_type 0 Price_in_thousands 2 Engine_size 1 Horsepower 1 Wheelbase 1 Width 1 Length 1 Curb_weight 2 Fuel_capacity 1 Fuel_efficiency 3 Latest_Launch 0 Power_perf_factor 2 dtype: int64

Methods of Treating Missing Values

Now the missing values can be either in the continuous or categorical variables. There are mainly two methods to deal with missing values, which are discussed below.

Remove the missing values: Here we use the dropna command to drop the observations where we have missing values. As discussed in Missing Value Treatment (under the Theory section), this method causes loss of valuable data. However, sometimes the datasets have very few missing values, therefore we can remove those observations by using this method.

python

carsales1 = carsales.dropna()

We can now again use the isnull function to see if there are any missing values in the dataset or not.

python

carsales1.isnull().sum()

Manufacturer 0 Model 0 Sales_in_thousands 0 __year_resale_value 0 Vehicle_type 0 Price_in_thousands 0 Engine_size 0 Horsepower 0 Wheelbase 0 Width 0 Length 0 Curb_weight 0 Fuel_capacity 0 Fuel_efficiency 0 Latest_Launch 0 Power_perf_factor 0 dtype: int64

Mean, Median and Mode Imputation: In this method, we impute the missing values with the mean, median or mode of their respective variable.

Mean Imputation

We impute the missing values with the mean of their variable.

Import Dataset: We import a dataset having missing values in a continuous variable.

python

CAR1 = pd.read_excel("C:/Users/user/Desktop/Data Sets/CAR1.xls")

CAR1 dataset with Car Name and X__year_resale_value columns, including missing values — Output: the CAR1 dataset.

Performing Mean Imputation: We now perform mean imputation using the fillna function, which helps us replace the missing values with the mean.

python

CAR1 = CAR1['X__year_resale_value'].fillna(CAR1['X__year_resale_value'].mean())

0 16.360000 1 193875.000000 2 18.225000 3 29.725000 4 22.255000 5 23.555000 6 39.000000 7 10791.750556 8 28.675000 9 36.125000 10 12.475000 11 13.740000 12 20.190000 13 13.360000 14 22.525000 15 27.100000 16 25.725000 17 18.225000 18 10791.750556 19 9.250000 Name: X__year_resale_value, dtype: float64

Median Imputation

We impute the missing values with the median value of their variable.

Import Dataset: We import a dataset having missing values in a continuous variable.

python

CAR2 = pd.read_excel("C:/Users/user/Desktop/Data Sets/CAR1.xls")

Performing Median Imputation: We perform median imputation using the fillna function.

python

CAR2 = CAR2['X__year_resale_value'].fillna(CAR2['X__year_resale_value'].median())
CAR2

0 16.360 1 193875.000 2 18.225 3 29.725 4 22.255 5 23.555 6 39.000 7 22.390 8 28.675 9 36.125 10 12.475 11 13.740 12 20.190 13 13.360 14 22.525 15 27.100 16 25.725 17 18.225 18 22.390 19 9.250 Name: X__year_resale_value, dtype: float64

Mode Imputation

We impute the missing values of a categorical variable with its mode.

Import Dataset: We import a dataset having missing values in a categorical variable.

python

CAR3 = pd.read_excel("C:/Users/user/Desktop/Data Sets/CarTypeData.xls")

CAR3 dataset with Car Name, Body Style and Sales in 1000s columns, including missing Body Style values — Output: the CAR3 dataset.

Import Library: We import the statistics library for performing mode imputation.

python

import statistics

Calculating Mode: We use the statistics library to calculate the mode of variables. For example, we calculate the mode of the variable 'Body Style'.

python

mode = statistics.mode(CAR3['Body Style'])
mode

'Sedan'

We find that the mode comes out to be 'Sedan', which means that we should replace all the missing values of the 'Body Style' variable with 'Sedan'.

Performing Mode Imputation: We perform mode imputation using the fillna function.

python

CAR3 = CAR3['Body Style'].fillna(CAR3['Body Style'].mode()[0])

0 Sedan 1 SUV 2 Hatch-Back 3 SUV 4 SUV 5 Hatch-Back 6 Sedan 7 Sedan 8 Sedan 9 Sedan 10 Hatch-Back 11 SUV 12 SUV 13 Sedan 14 Sedan 15 Sedan 16 Sedan 17 Sedan 18 Sedan 19 SUV Name: Body Style, dtype: object

There are other sophisticated methods, such as using KNN, Linear and Logistic Models, to predict the missing values and replace the missing values with them. All such models have been discussed in the Application part of the Modeling section.

Outlier Treatment

Apart from Missing Value Treatment, the other most important and crucial preprocessing/data preparation step is Outlier Treatment. We have found under the Application of Basic Statistics how outliers can affect the dataset, therefore it is very important to do outlier treatment.

Import Dataset: We import a dataset that contains an outlier.

python

Outlier = pd.read_excel("C:/Users/user/Desktop/Data Sets/Outlier 1.xls")
Outlier

Outlier dataset with x and y columns — Output: the Outlier dataset.

Method of Treating Outliers: In order to do outlier treatment, we calculate a benchmark which will replace the outliers. For this, we take the help of the following:

Boxplot

We can create boxplots and identify the outliers. We can also use them to decide the value at which we might want to cap the outliers.

python

Outlier.boxplot(column="y",vert=False)

Boxplot of y showing two outlier points — Output: boxplot of y before treatment.

Quartiles

We can use the quartiles method to compute the inter-quartile range.

Calculating q75 and q25.

python

q75, q25 = np.percentile(Outlier.y, [75 ,25])

Output of q75.

python

q75

14.5

Output of q25.

python

q25

5.5

Inter-Quartile Range.

python

iqr = q75-q25
iqr

We now create a benchmark using the InterQuartile Range for Outlier Treatment.

python

bench = q75 + (iqr*1.5)
bench

28.0

We use loc to identify the position of the outlier cell and replace it with our capping value.

python

Outlier.loc[Outlier.y > 28, 'y'] = bench
Outlier

Outlier dataset after capping the outlier value to 28 — Output: the Outlier dataset after capping.

We can use a boxplot to check visually if there are any outliers in the dataset or not.

python

Outlier.boxplot(column="y",vert=False)

Boxplot of y after capping, showing no outlier points — Output: boxplot of y after treatment.

We can see that there are no outliers present in the dataset.

Standard Deviation

Another method for outlier treatment is to use 'mean+3std' to replace outliers.

python

bench2 = Outlier['y'].mean() + 3*(Outlier['y'].std())
bench2

50.36484816687335

Here we can use the value 50.36484816687335 to cap the outliers.

There are various "miscellaneous" methods that come under preprocessing of data or data preparation. In this blog, we explored various such methods. Missing Value and Outlier Treatment act as the most important steps towards making the data clean and usable for various modeling algorithms. Various methods of consolidating datasets were also explored here, which are highly useful to make the analysis more meaningful. Data Preprocessing requires Feature Engineering, and there are multiple methods of performing feature engineering; the application (in Python) of all such methods has been explored in the next blog.