Data Exploration & Preparation

A model is only as good as the data it is trained on. This section covers the consolidation, cleaning, exploration and transformation that turn raw data into data a modeling algorithm can actually use.

stage 01 / 03concept introduction

The accuracy of a model highly depends on the data on which it is being trained. If the data is not fit for the model, then such a model can crash when it is faced with real-world unseen data. Thus it is essential to modify the data in such a way that it becomes compatible with a modeling algorithm. This modification, or rather preparation, includes consolidation, cleaning, exploration and transformation of data.

Before we use a dataset for creating a model, we first need to explore the data to get an idea of what our data is which eventually helps in better decision making during the modeling process. Also, certain modeling algorithms require a certain type of data to function properly which calls for certain changes to be made in the data which is done by transforming various features of data. Also, data are almost always riddled with the problem of outliers and missing values and it becomes important to address these issues as by not doing so, the output from the model can be very wrong.

All these steps together form a very important aspect of data analysis, Data Exploration and Preparation, and in this section, this aspect will be explored.

This section, like the other three sections, is divided into two parts: Theory and Application, where in theory the need for such a preparation of data is explored; also, different kinds of modeling algorithms require the data to be prepared differently. All such aspects and many other aspects will be discussed in theory.

In Application, different datasets will be explored and prepared using Python and R.

Chalkboard covered in handwritten mathematical formulas and a distribution curve

stage 02 / 03theory

Understand what the data needs first.

There are a lot of separate actions that are performed on the data which when combined together, can be called the methods of data exploration and preparation. Here we will explore the various methods of consolidating and treating the data making it more usable for the algorithms. Another important aspect explored here is of the various techniques of engineering the features. These techniques are grouped into Transformation, Scaling, Construction and Reduction of features.

ConsolidationCleaningExplorationTransformationOutliersMissing ValuesScalingReduction

Explore Theory

Source code open in an editor on a screen, representing applied data preparation in Python and R

stage 03 / 03application

Then prepare it in code.

In Application, Python and R are put into use for preparing a dataset from the scratch. Here we explore sample dataset by using the various packages provided in different software. The aspects covered in the Theory section are covered here also, however, the focus is more on the various methods of application and less stress is given on the interpretation aspect. The code provided here is quasi-universal and can be replicated when exploring different datasets.

PythonRPackagesCleaningTransformationOutliersMissing ValuesDatasets

Explore Applications