Data Exploration & Preparation: Theory

There are different techniques explored in this section which help in increasing the quality of data, eventually making us able to create a more stable and reliable model.

A typical set of steps taken between the time we get hold of a dataset and the time we start building a model include the following:

Exploration of Data: The moment we get our hands on the dataset/s, we first explore the dataset/s. This includes knowing about the number and types of features in the dataset/s, performing some descriptive statistics on the dataset etc.

Consolidation of Data: We often need to create a model on a dataset which doesn't 'look' the way it should. Often the required data is not at one place and is scattered among various different datasets. For example, we need to predict the sales of a shopping outlet. However, the data is available in spreadsheets which have a limit of a million rows and few thousand columns. But in today's age where data is produced rapidly, it is possible that we have data in different spreadsheets and this requires consolidation of datasets. Also, sometimes different information regarding the same subject is scattered along different datasets and this requires the merging of datasets. For example, we need to analyse customer behaviour, however, we have two datasets with the customer's demographic details in one dataset and the customer's transaction details in the other. Here various methods of data merging come into play.

Outlier and Missing value treatment: Once we get the dataset in the shape and size we want, we proceed with treating the data from outliers and missing values. Outliers often cause the algorithm to break, and for certain algorithms, getting away with outliers becomes important. As outliers may not always be error observations and can actually be a part of the original dataset, it becomes important to treat them with caution as throwing observations without proper analysis can do more harm than good. Also, missing values make it difficult to use all the features of the dataset for modeling. To handle missing values, various methods can be used which vary from being very simple to considerably sophisticated.

Feature Engineering: It is a blanket term for various actions performed on the features of the dataset. Distance-based algorithms such as KNN require the data to be scaled so that the results can be considered meaningful, while algorithms such as Linear Regression (using OLS method) can produce artificially good results if the features being used are more than necessary, and for this, we perform various feature reduction techniques. Also, some features alone may not make sense or may not be suitable for an algorithm and may be required to be decomposed or consolidated to become more useful, and for this, we use various feature construction techniques. Thus various modifications are required to be done on the features of the dataset to make them suitable for the algorithms.

It is important to note that the above-mentioned steps can have a different chronology, different usage or can have some additional steps. In this section, we will explore the above-mentioned methods of exploration and preparation of data.

Data preparation tools: consolidation, exploration through univariate and bivariate analysis, and treatment of missing values and outliers

category 01 / 02miscellaneous methods

Miscellaneous Methods

This section contains various blogs addressing numerous aspects of data preparation and exploration. Here the methods of consolidating a dataset are explored along with the various uni-variate and bi-variate analysis that acts as the backbone of data exploration. Various methods for treating the data from missing values as well as outliers are explored in this section. All such methods are essentially different from each other where consolidation of dataset along with missing value and outlier treatment is related to data preparation while uni-variate and bi-variate analysis address the aspect of data exploration.

Consolidation of DatasetsUnivariate & Bivariate AnalysisOutlier TreatmentMissing Value Treatment

Know More