// miscellaneous methods

Missing Value Treatment

Missing value treatment is one of the most important steps in data pre-processing. It includes identifying missing values and treating them in a way that the minimum amount of information is lost. Missing value treatment is required on the data before it can be used for modelling, as missing values in the data can reduce the power of the model and can make us draw wrong inferences from the model, often leading to wrong predictions and classifications.

Missing values are often found in data due to various reasons, and occur during the Data Extraction and Data Collection processes. Human errors, such as the person collecting the data, can cause missing values in the data. Other factors, such as data collected through a survey, can have missing values where the respondent decides not to respond to certain questions because of inadequate appropriate response options, or the response may demand information which the respondent may not feel like sharing.

Three Types of Missing Values

All the missing values can be broadly categorized into three types: MCAR, MAR and NMAR.

MCAR: Missing Completely At Random is when the missing values are present randomly, and the missing values of a variable do not depend on either the known values or the missing data.

MAR: Missing At Random is when the missing values occur randomly, but there is still some association, as a missing value in a variable may depend on the known values but not on the value of the missing data itself.

NMAR: Not Missing At Random is when the missingness has some pattern and depends on the value of the missing data itself (the unobserved value), rather than on other observed variables.

Methods of Treating Missing Values

Ignoring and Discarding Data

There are two ways of discarding or deleting the missing values: Listwise and Pairwise.

Listwise Deletion: This is the most simple way of treating a missing value, where the records (rows) having missing values are removed from the dataset. The single biggest disadvantage of this method is that it can lead to potential loss of data, especially if there is a significant amount of missing values present in the data.

Pairwise Deletion: Pairwise Deletion is also known as available-case analysis, as analysis of variables is done where missing values are present. Here correlation coefficients are used for each pair of variables. The advantage of this method is that the loss of data is comparatively low and can be preferred over Listwise Deletion. One of the disadvantages of this method is that it uses different sample sizes for different variables.

Mean / Median / Mode Imputation

In these methods of imputation, the missing values are imputed by estimates such as mean, median or mode. Missing data can be replaced by mean or median for numerical variables, while mode can be used to impute missing values in categorical variables. For example, if we have a dataset having the variables Name (Name of students), Marks (Marks of students) and Gender (Gender of the student), and missing values are found in the variable Marks, then the mean of this variable can be used to replace the missing values. Such a Mean Imputation is called Generalized Imputation. However, if we compute a separate mean for the marks of Males and the marks of Females, i.e. we also consider the gender of the students in computing their average marks and then impute missing values separately for males and females, such a method is known as Similar Case Imputation.

Use of Prediction Models

This is where supervised modelling can be done to find the values which can be used for imputing missing values. Here we divide our dataset into train and test, with the train set having no missing values and the test set only having missing values. The training dataset can be used to train our model to predict the missing values for the target variable, and this model can be used to find the missing values in the test dataset. Modelling techniques such as Linear Regression (for missing values in continuous variables), Logistic Regression (for missing values in categorical variables) etc can be used. However, one major drawback of this method is that if there is little to no relationship between the target variable and predictor variables, then the missing values won’t be predicted correctly. Thus we have to assume that attributes have relationships (correlations) among themselves. Also, the randomness / noise found in the data will be restricted by the use of such a method.

K-Nearest Neighbour as an Imputation Method

KNN is discussed in detail in the section on Supervised Modelling. The K-Nearest Neighbour algorithm can be used to estimate and substitute missing data. Here the missing values are found by calculating the observations that are closest to it, based on other features.

For example, we have a dataset where we have three variables: Income, Age, and Number of Cars Owned. We have missing values in the last variable (Number of Cars Owned). Here we use KNN, where we will consider the Income and Age of the observation where the missing value is, and on the basis of the most common value of cars in their neighbourhood, impute the missing value. Here we presume that people with similar income and age will have a similar number of cars. Thus the assumption behind using KNN for missing values is that a point’s value can be approximated by the values of the points that are closest to it, based on other variables. The most frequent value among the k nearest neighbours can be used to impute missing values of a categorical variable, while the mean of the values in the k nearest neighbours can be considered to impute missing values of continuous features.

However, this method is very time consuming (especially if the dataset is large), and the value of k is very critical in determining the missing value; to counter this problem, cross-validation or experimental analysis may be required to evaluate the efficiency of the k-nearest neighbour algorithm, which will further increase the time taken to complete this process.

Missing values are ubiquitous, especially when working with large datasets. As they can adversely affect the performance of the model, it becomes important to treat the data for missing values. Above, methods range from the simple approach of dropping variables to sophisticated methods employing predictive models, and can be used depending on how much accuracy is required. Also, before performing missing value imputation, data is also required to be treated for outliers, which has been discussed in the previous article.