// application · python
Miscellaneous Methods in Python
Before we perform modeling on our datasets, we need to prepare the dataset first. This pre-processing of data includes various steps which were explored in Miscellaneous Methods under the section Data Exploration and Preparation. In this blog, we will explore the application of various data preparation methods in Python. These methods include the following:
- Consolidation of Data Sets
- Missing Value Treatment
- Outlier Treatment
Note that univariate and bivariate analysis, which are methods of data exploration (and were discussed in the theory section under Data Exploration and Preparation), have already been discussed in the blogs related to the application of descriptive and inferential statistics, as these methods of analysis use various descriptive and inferential statistics.
Consolidation of Data Sets
There are mainly two ways of consolidating datasets: appending and merging.
Importing preliminary libraries: We first start off by adding some important libraries that will help us in working.
import pandas as pd import numpy as np import matplotlib %matplotlib inline
Appending
Appending of datasets is concatenating or stacking datasets either row-wise or column-wise (whereas merging of datasets is combining datasets based on a common variable).
Import Dataset: Here we have taken two datasets of cars sold by a dealer. (Please note that all datasets used in the sections of coding are hypothetical. If any dataset has been taken from an outside source, the link of the source will be mentioned for that particular dataset.) Note that the variable names should be the same while appending datasets.
We begin with importing the first car dataset.
CarData1 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Car_Data1.xls")
CarData1
We now upload the second car dataset.
CarData2 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Car_Data2.xls")
CarData2
Appending Datasets Row-wise
Here we append the datasets by concatenating the two datasets row-wise. We use the following command to append the two datasets - CarData1 and CarData2.
row_concat = pd.concat([CarData1,CarData2]) row_concat

Exporting output of the appended dataset: We can use the to_excel command to export the results to an excel file.
row_concat.to_excel("C:/Users/user/Desktop/Data Sets/row_concat.xls")Appending Datasets Column-wise
Here we append datasets by concatenating the two datasets column-wise. Here we use the function concat from the pandas library where we set the axis to 1, as by mentioning 'axis=1' we tell the program to concatenate datasets column-wise. (Note that the R equivalent of pandas concat is cbind.)
col_concat = pd.concat([CarData1,CarData2],axis=1) col_concat

Exporting output of the appended dataset: The way we did before, we use to_excel to export the output to excel.
col_concat.to_excel("C:/Users/user/Desktop/Data Sets/col_concat.xls")Renaming Columns: We rename the columns so as to easily differentiate between the two datasets.
col_concat.columns=['Car_Name1','Sales in 1000s','Car_Name2','Sales in 1000s'] col_concat

Merging
Merging of datasets is mainly used to combine datasets with common values. We use joins to merge datasets based on our requirement. First, we will do the normal merging of datasets. This is equivalent to appending of datasets column-wise. In merging, at least one variable has to be common in both datasets, on the basis of which merging will take place.
Importing Dataset: We import a dataset where, unlike CarData1 and CarData2 (where we had the sales information of the car), here we will have a Price variable.
CarData3 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Car_Data3.xls")Simple Merge: We use the merge function from pandas. The following command is equivalent to appending the datasets column-wise.
Merge1 = pd.merge(CarData1,CarData3,on='Car Name',sort=False) Merge1

Ways of Joining Tables
As mentioned in Consolidation of Datasets, there are various ways of joining two datasets and we will be exploring some of them here.
Import Datasets: In order to explain different joins, we will be using two datasets of a store. One has information on the product bought by the customer, and the other dataset has the city of residence of that customer.
Importing the first Store dataset.
StoreData1 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Store_Data1.xls")
StoreData1
Importing the second Store dataset.
StoreData2 = pd.read_excel("C:/Users/user/Desktop/Data Sets/Store_Data2.xls")
StoreData2
We can now perform the various methods of merging these two datasets. In this blog, we will perform Inner Join, Outer Join, Left Join and Right Join.
Inner Join
An inner join is used to find the common information between the two datasets, i.e. it is the intersection of two datasets. Here it takes the common entries of the customers from both the datasets and makes a table with their corresponding entries.
Inner1 = pd.merge(StoreData1,StoreData2,how='inner',on='Cust_ID') Inner1

Outer Join
Outer Join takes all the entries of both the datasets even if the entry for a variable is missing for that customer in either of the datasets. Basically, it is the union of both the datasets.
Outer1 = pd.merge(StoreData1,StoreData2,how='outer',on='Cust_ID') Outer1

Left Join
Left join gives you the details of customers in dataset 1 and their corresponding entries from dataset 2. It does not take any entry of the customer that is present in dataset 2 and not in dataset 1.
Left1 = pd.merge(StoreData1,StoreData2,how='left',on='Cust_ID') Left1

Right Join
Right Join is similar to the left join. While the Right Join takes all entries of dataset 2 and corresponding entries of dataset 1. In the left join, it's the other way around.
Right1 = pd.merge(StoreData1,StoreData2,how='right',on='Cust_ID') Right1

Missing Value Treatment
The next main step of data preparation is missing value treatment. The data at hand may or may not have missing values. But if the dataset has missing values, then we will have to treat them first, either by removing them or replacing them with the mean, median or mode, depending on the type of the variable.
Import Dataset: We use a car sales dataset having multiple variables and perform missing value treatment on it.
carsales = pd.read_csv("C:/Users/user/Desktop/Data Sets/car_sales.csv")Checking Missing Values: Before treating the missing values, we need to check if the data has missing values or not. In order to check for the missing values, we use the isnull function in Python. Here we count the missing values and take a sum of them so that for each variable we know the number of missing values present in them.
carsales.isnull().sum()
Methods of Treating Missing Values
Now the missing values can be either in the continuous or categorical variables. There are mainly two methods to deal with missing values, which are discussed below.
Remove the missing values: Here we use the dropna command to drop the observations where we have missing values. As discussed in Missing Value Treatment (under the Theory section), this method causes loss of valuable data. However, sometimes the datasets have very few missing values, therefore we can remove those observations by using this method.
carsales1 = carsales.dropna()
We can now again use the isnull function to see if there are any missing values in the dataset or not.
carsales1.isnull().sum()
Mean, Median and Mode Imputation: In this method, we impute the missing values with the mean, median or mode of their respective variable.
Mean Imputation
We impute the missing values with the mean of their variable.
Import Dataset: We import a dataset having missing values in a continuous variable.
CAR1 = pd.read_excel("C:/Users/user/Desktop/Data Sets/CAR1.xls")
Performing Mean Imputation: We now perform mean imputation using the fillna function, which helps us replace the missing values with the mean.
CAR1 = CAR1['X__year_resale_value'].fillna(CAR1['X__year_resale_value'].mean())
Median Imputation
We impute the missing values with the median value of their variable.
Import Dataset: We import a dataset having missing values in a continuous variable.
CAR2 = pd.read_excel("C:/Users/user/Desktop/Data Sets/CAR1.xls")Performing Median Imputation: We perform median imputation using the fillna function.
CAR2 = CAR2['X__year_resale_value'].fillna(CAR2['X__year_resale_value'].median()) CAR2
Mode Imputation
We impute the missing values of a categorical variable with its mode.
Import Dataset: We import a dataset having missing values in a categorical variable.
CAR3 = pd.read_excel("C:/Users/user/Desktop/Data Sets/CarTypeData.xls")
Import Library: We import the statistics library for performing mode imputation.
import statistics
Calculating Mode: We use the statistics library to calculate the mode of variables. For example, we calculate the mode of the variable 'Body Style'.
mode = statistics.mode(CAR3['Body Style']) mode
We find that the mode comes out to be 'Sedan', which means that we should replace all the missing values of the 'Body Style' variable with 'Sedan'.
Performing Mode Imputation: We perform mode imputation using the fillna function.
CAR3 = CAR3['Body Style'].fillna(CAR3['Body Style'].mode()[0])
There are other sophisticated methods, such as using KNN, Linear and Logistic Models, to predict the missing values and replace the missing values with them. All such models have been discussed in the Application part of the Modeling section.
Outlier Treatment
Apart from Missing Value Treatment, the other most important and crucial preprocessing/data preparation step is Outlier Treatment. We have found under the Application of Basic Statistics how outliers can affect the dataset, therefore it is very important to do outlier treatment.
Import Dataset: We import a dataset that contains an outlier.
Outlier = pd.read_excel("C:/Users/user/Desktop/Data Sets/Outlier 1.xls")
Outlier
Method of Treating Outliers: In order to do outlier treatment, we calculate a benchmark which will replace the outliers. For this, we take the help of the following:
Boxplot
We can create boxplots and identify the outliers. We can also use them to decide the value at which we might want to cap the outliers.
Outlier.boxplot(column="y",vert=False)

Quartiles
We can use the quartiles method to compute the inter-quartile range.
Calculating q75 and q25.
q75, q25 = np.percentile(Outlier.y, [75 ,25])
Output of q75.
q75
Output of q25.
q25
Inter-Quartile Range.
iqr = q75-q25 iqr
We now create a benchmark using the InterQuartile Range for Outlier Treatment.
bench = q75 + (iqr*1.5) bench
We use loc to identify the position of the outlier cell and replace it with our capping value.
Outlier.loc[Outlier.y > 28, 'y'] = bench Outlier

We can use a boxplot to check visually if there are any outliers in the dataset or not.
Outlier.boxplot(column="y",vert=False)

We can see that there are no outliers present in the dataset.
Standard Deviation
Another method for outlier treatment is to use 'mean+3std' to replace outliers.
bench2 = Outlier['y'].mean() + 3*(Outlier['y'].std()) bench2
Here we can use the value 50.36484816687335 to cap the outliers.
There are various "miscellaneous" methods that come under preprocessing of data or data preparation. In this blog, we explored various such methods. Missing Value and Outlier Treatment act as the most important steps towards making the data clean and usable for various modeling algorithms. Various methods of consolidating datasets were also explored here, which are highly useful to make the analysis more meaningful. Data Preprocessing requires Feature Engineering, and there are multiple methods of performing feature engineering; the application (in Python) of all such methods has been explored in the next blog.
TM