home/data exploration & preparation/application/python

// application · python

Miscellaneous Methods in Python

Before we perform modeling on our datasets, we need to prepare the dataset first. This pre-processing of data includes various steps which were explored in Miscellaneous Methods under the section Data Exploration and Preparation. In this blog, we will explore the application of various data preparation methods in Python. These methods include the following:

Note that univariate and bivariate analysis, which are methods of data exploration (and were discussed in the theory section under Data Exploration and Preparation), have already been discussed in the blogs related to the application of descriptive and inferential statistics, as these methods of analysis use various descriptive and inferential statistics.

Consolidation of Data Sets

There are mainly two ways of consolidating datasets: appending and merging.

Importing preliminary libraries: We first start off by adding some important libraries that will help us in working.

python
import pandas as pd
import numpy as np
import matplotlib
%matplotlib inline

Appending

Appending of datasets is concatenating or stacking datasets either row-wise or column-wise (whereas merging of datasets is combining datasets based on a common variable).

Import Dataset: Here we have taken two datasets of cars sold by a dealer. (Please note that all datasets used in the sections of coding are hypothetical. If any dataset has been taken from an outside source, the link of the source will be mentioned for that particular dataset.) Note that the variable names should be the same while appending datasets.

We begin with importing the first car dataset.

python
CarData1 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Car_Data1.xls")
CarData1
CarData1 dataset with Car Name and Sales in 1000s columns
Output: the CarData1 dataset.

We now upload the second car dataset.

python
CarData2 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Car_Data2.xls")
CarData2
CarData2 dataset with Car Name and Sales in 1000s columns
Output: the CarData2 dataset.

Appending Datasets Row-wise

Here we append the datasets by concatenating the two datasets row-wise. We use the following command to append the two datasets - CarData1 and CarData2.

python
row_concat = pd.concat([CarData1,CarData2])
row_concat
row_concat dataset, CarData1 and CarData2 stacked row-wise
Output: row_concat.

Exporting output of the appended dataset: We can use the to_excel command to export the results to an excel file.

python
row_concat.to_excel("C:/Users/user/Desktop/Data Sets/row_concat.xls")

Appending Datasets Column-wise

Here we append datasets by concatenating the two datasets column-wise. Here we use the function concat from the pandas library where we set the axis to 1, as by mentioning 'axis=1' we tell the program to concatenate datasets column-wise. (Note that the R equivalent of pandas concat is cbind.)

python
col_concat = pd.concat([CarData1,CarData2],axis=1)
col_concat
col_concat dataset, CarData1 and CarData2 joined column-wise
Output: col_concat.

Exporting output of the appended dataset: The way we did before, we use to_excel to export the output to excel.

python
col_concat.to_excel("C:/Users/user/Desktop/Data Sets/col_concat.xls")

Renaming Columns: We rename the columns so as to easily differentiate between the two datasets.

python
col_concat.columns=['Car_Name1','Sales in 1000s','Car_Name2','Sales in 1000s']
col_concat
col_concat dataset with renamed columns
Output: col_concat with renamed columns.

Merging

Merging of datasets is mainly used to combine datasets with common values. We use joins to merge datasets based on our requirement. First, we will do the normal merging of datasets. This is equivalent to appending of datasets column-wise. In merging, at least one variable has to be common in both datasets, on the basis of which merging will take place.

Importing Dataset: We import a dataset where, unlike CarData1 and CarData2 (where we had the sales information of the car), here we will have a Price variable.

python
CarData3 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Car_Data3.xls")

Simple Merge: We use the merge function from pandas. The following command is equivalent to appending the datasets column-wise.

python
Merge1 = pd.merge(CarData1,CarData3,on='Car Name',sort=False)
Merge1
Merge1 dataset, CarData1 merged with CarData3 on Car Name
Output: Merge1.

Ways of Joining Tables

As mentioned in Consolidation of Datasets, there are various ways of joining two datasets and we will be exploring some of them here.

Import Datasets: In order to explain different joins, we will be using two datasets of a store. One has information on the product bought by the customer, and the other dataset has the city of residence of that customer.

Importing the first Store dataset.

python
StoreData1 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Store_Data1.xls")
StoreData1
StoreData1 dataset with Cust_ID and Product columns
Output: the StoreData1 dataset.

Importing the second Store dataset.

python
StoreData2 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Store_Data2.xls")
StoreData2
StoreData2 dataset with Cust_ID and City columns
Output: the StoreData2 dataset.

We can now perform the various methods of merging these two datasets. In this blog, we will perform Inner Join, Outer Join, Left Join and Right Join.

Inner Join

An inner join is used to find the common information between the two datasets, i.e. it is the intersection of two datasets. Here it takes the common entries of the customers from both the datasets and makes a table with their corresponding entries.

python
Inner1 = pd.merge(StoreData1,StoreData2,how='inner',on='Cust_ID')
Inner1
Inner1 dataset, the inner join of StoreData1 and StoreData2
Output: Inner1.

Outer Join

Outer Join takes all the entries of both the datasets even if the entry for a variable is missing for that customer in either of the datasets. Basically, it is the union of both the datasets.

python
Outer1 = pd.merge(StoreData1,StoreData2,how='outer',on='Cust_ID')
Outer1
Outer1 dataset, the outer join of StoreData1 and StoreData2
Output: Outer1.

Left Join

Left join gives you the details of customers in dataset 1 and their corresponding entries from dataset 2. It does not take any entry of the customer that is present in dataset 2 and not in dataset 1.

python
Left1 = pd.merge(StoreData1,StoreData2,how='left',on='Cust_ID')
Left1
Left1 dataset, the left join of StoreData1 and StoreData2
Output: Left1.

Right Join

Right Join is similar to the left join. While the Right Join takes all entries of dataset 2 and corresponding entries of dataset 1. In the left join, it's the other way around.

python
Right1 = pd.merge(StoreData1,StoreData2,how='right',on='Cust_ID')
Right1
Right1 dataset, the right join of StoreData1 and StoreData2
Output: Right1.

Missing Value Treatment

The next main step of data preparation is missing value treatment. The data at hand may or may not have missing values. But if the dataset has missing values, then we will have to treat them first, either by removing them or replacing them with the mean, median or mode, depending on the type of the variable.

Import Dataset: We use a car sales dataset having multiple variables and perform missing value treatment on it.

python
carsales = pd.read_csv("C:/Users/user/Desktop/Data Sets/car_sales.csv")

Checking Missing Values: Before treating the missing values, we need to check if the data has missing values or not. In order to check for the missing values, we use the isnull function in Python. Here we count the missing values and take a sum of them so that for each variable we know the number of missing values present in them.

python
carsales.isnull().sum()
Manufacturer 0 Model 0 Sales_in_thousands 0 __year_resale_value 36 Vehicle_type 0 Price_in_thousands 2 Engine_size 1 Horsepower 1 Wheelbase 1 Width 1 Length 1 Curb_weight 2 Fuel_capacity 1 Fuel_efficiency 3 Latest_Launch 0 Power_perf_factor 2 dtype: int64

Methods of Treating Missing Values

Now the missing values can be either in the continuous or categorical variables. There are mainly two methods to deal with missing values, which are discussed below.

Remove the missing values: Here we use the dropna command to drop the observations where we have missing values. As discussed in Missing Value Treatment (under the Theory section), this method causes loss of valuable data. However, sometimes the datasets have very few missing values, therefore we can remove those observations by using this method.

python
carsales1 = carsales.dropna()

We can now again use the isnull function to see if there are any missing values in the dataset or not.

python
carsales1.isnull().sum()
Manufacturer 0 Model 0 Sales_in_thousands 0 __year_resale_value 0 Vehicle_type 0 Price_in_thousands 0 Engine_size 0 Horsepower 0 Wheelbase 0 Width 0 Length 0 Curb_weight 0 Fuel_capacity 0 Fuel_efficiency 0 Latest_Launch 0 Power_perf_factor 0 dtype: int64

Mean, Median and Mode Imputation: In this method, we impute the missing values with the mean, median or mode of their respective variable.

Mean Imputation

We impute the missing values with the mean of their variable.

Import Dataset: We import a dataset having missing values in a continuous variable.

python
CAR1 = pd.read_excel("C:/Users/user/Desktop/Data Sets/CAR1.xls")
CAR1 dataset with Car Name and X__year_resale_value columns, including missing values
Output: the CAR1 dataset.

Performing Mean Imputation: We now perform mean imputation using the fillna function, which helps us replace the missing values with the mean.

python
CAR1 = CAR1['X__year_resale_value'].fillna(CAR1['X__year_resale_value'].mean())
0 16.360000 1 193875.000000 2 18.225000 3 29.725000 4 22.255000 5 23.555000 6 39.000000 7 10791.750556 8 28.675000 9 36.125000 10 12.475000 11 13.740000 12 20.190000 13 13.360000 14 22.525000 15 27.100000 16 25.725000 17 18.225000 18 10791.750556 19 9.250000 Name: X__year_resale_value, dtype: float64

Median Imputation

We impute the missing values with the median value of their variable.

Import Dataset: We import a dataset having missing values in a continuous variable.

python
CAR2 = pd.read_excel("C:/Users/user/Desktop/Data Sets/CAR1.xls")

Performing Median Imputation: We perform median imputation using the fillna function.

python
CAR2 = CAR2['X__year_resale_value'].fillna(CAR2['X__year_resale_value'].median())
CAR2
0 16.360 1 193875.000 2 18.225 3 29.725 4 22.255 5 23.555 6 39.000 7 22.390 8 28.675 9 36.125 10 12.475 11 13.740 12 20.190 13 13.360 14 22.525 15 27.100 16 25.725 17 18.225 18 22.390 19 9.250 Name: X__year_resale_value, dtype: float64

Mode Imputation

We impute the missing values of a categorical variable with its mode.

Import Dataset: We import a dataset having missing values in a categorical variable.

python
CAR3 = pd.read_excel("C:/Users/user/Desktop/Data Sets/CarTypeData.xls")
CAR3 dataset with Car Name, Body Style and Sales in 1000s columns, including missing Body Style values
Output: the CAR3 dataset.

Import Library: We import the statistics library for performing mode imputation.

python
import statistics

Calculating Mode: We use the statistics library to calculate the mode of variables. For example, we calculate the mode of the variable 'Body Style'.

python
mode = statistics.mode(CAR3['Body Style'])
mode
'Sedan'

We find that the mode comes out to be 'Sedan', which means that we should replace all the missing values of the 'Body Style' variable with 'Sedan'.

Performing Mode Imputation: We perform mode imputation using the fillna function.

python
CAR3 = CAR3['Body Style'].fillna(CAR3['Body Style'].mode()[0])
0 Sedan 1 SUV 2 Hatch-Back 3 SUV 4 SUV 5 Hatch-Back 6 Sedan 7 Sedan 8 Sedan 9 Sedan 10 Hatch-Back 11 SUV 12 SUV 13 Sedan 14 Sedan 15 Sedan 16 Sedan 17 Sedan 18 Sedan 19 SUV Name: Body Style, dtype: object

There are other sophisticated methods, such as using KNN, Linear and Logistic Models, to predict the missing values and replace the missing values with them. All such models have been discussed in the Application part of the Modeling section.

Outlier Treatment

Apart from Missing Value Treatment, the other most important and crucial preprocessing/data preparation step is Outlier Treatment. We have found under the Application of Basic Statistics how outliers can affect the dataset, therefore it is very important to do outlier treatment.

Import Dataset: We import a dataset that contains an outlier.

python
Outlier = pd.read_excel("C:/Users/user/Desktop/Data Sets/Outlier 1.xls")
Outlier
Outlier dataset with x and y columns
Output: the Outlier dataset.

Method of Treating Outliers: In order to do outlier treatment, we calculate a benchmark which will replace the outliers. For this, we take the help of the following:

Boxplot

We can create boxplots and identify the outliers. We can also use them to decide the value at which we might want to cap the outliers.

python
Outlier.boxplot(column="y",vert=False)
Boxplot of y showing two outlier points
Output: boxplot of y before treatment.

Quartiles

We can use the quartiles method to compute the inter-quartile range.

Calculating q75 and q25.

python
q75, q25 = np.percentile(Outlier.y, [75 ,25])

Output of q75.

python
q75
14.5

Output of q25.

python
q25
5.5

Inter-Quartile Range.

python
iqr = q75-q25
iqr
9

We now create a benchmark using the InterQuartile Range for Outlier Treatment.

python
bench = q75 + (iqr*1.5)
bench
28.0

We use loc to identify the position of the outlier cell and replace it with our capping value.

python
Outlier.loc[Outlier.y > 28, 'y'] = bench
Outlier
Outlier dataset after capping the outlier value to 28
Output: the Outlier dataset after capping.

We can use a boxplot to check visually if there are any outliers in the dataset or not.

python
Outlier.boxplot(column="y",vert=False)
Boxplot of y after capping, showing no outlier points
Output: boxplot of y after treatment.

We can see that there are no outliers present in the dataset.

Standard Deviation

Another method for outlier treatment is to use 'mean+3std' to replace outliers.

python
bench2 = Outlier['y'].mean() + 3*(Outlier['y'].std())
bench2
50.36484816687335

Here we can use the value 50.36484816687335 to cap the outliers.

There are various "miscellaneous" methods that come under preprocessing of data or data preparation. In this blog, we explored various such methods. Missing Value and Outlier Treatment act as the most important steps towards making the data clean and usable for various modeling algorithms. Various methods of consolidating datasets were also explored here, which are highly useful to make the analysis more meaningful. Data Preprocessing requires Feature Engineering, and there are multiple methods of performing feature engineering; the application (in Python) of all such methods has been explored in the next blog.

ESC
100 pages indexed · Esc to close