home/data exploration & preparation/theory/miscellaneous methods

// data exploration & preparation · miscellaneous methods

Miscellaneous Methods

Various actions performed on the data can be categorized into Data Exploration and Preparation. Among such actions is the consolidation of datasets, uni-variate and bi-variate analysis, missing value and outlier treatment.

Two datasets merging into one, with an outlier point being swept aside

We can perform various uni-variate and bi-variate analysis that help us in exploring a dataset.

Consolidation of the dataset is another important activity where various datasets are consolidated through the means of appending, merging etc.

Outlier Treatment is another very important step where outliers are identified and treated. There are also highly sophisticated methods of identifying outliers which have been discussed under Anomaly Detection (Section 3, Modeling).

Missing values, as briefly mentioned in the introduction of the theory section, can be very harmful and various simple and sophisticated methods can be performed to treat such values. All such methods may be required to put into practice when preparing data for modeling and are explored in the following blogs.

Three overlapping circles representing the consolidation of datasets
blog 01 / 04consolidation of datasets

Consolidation of Datasets

The two main methods of consolidating datasets are appending and merging. In appending, the various datasets are combined vertically while in merging, datasets are consolidated horizontally. Here we also understand the various types of relationship that two datasets might share such as One-to-One, One-to-Many, Many-to-Many. Once the relationships are understood, the different ways of merging can be explored which include Inner Join, Left Join, Right Join, Full Join etc.

AppendingMergingInner / Left / Right / Full Join
Know More
A bar and pie chart representing univariate and bivariate analysis
blog 02 / 04univariate & bivariate analysis

Univariate & Bivariate Analysis

Various inferential and descriptive statistics are used to explore the data and get a better understanding of it. Under uni-variate analysis, each feature of the dataset is individually analyzed. Here different descriptive statistics are used to explore categorical and numerical features of a dataset. Bi-variate analysis, however, deals with two features where combinations of features such as numerical-numerical, categorical-categorical, numerical-categorical are analyzed and explored.

Univariate AnalysisBivariate Analysis
Know More
A scatter plot on a chalkboard with an outlier point called out
blog 03 / 04outlier treatment

Outlier Treatment

Outlier treatment is among the most important and tricky aspects of data pre-processing as it can greatly affect the outputs produced by the learning algorithms. Methods of outlier treatment include deleting observations having outliers, identifying and replacing outliers through the use of box-plots, quartile ranges, quantiles/percentiles, standard deviation, etc. Other sophisticated methods of identifying outliers include clustering along with the various methods of anomaly detection explored in the Modeling section.

Box PlotsQuartile RangesStandard Deviation
Know More
Puzzle pieces filling in gaps, representing missing value treatment
blog 04 / 04missing value treatment

Missing Value Treatment

Different types of missing values are explored here. Once identified, different treatment methods can be put to use to minimize the adverse effect that the missing values may cause to the model's performance. The most common methods include treatment of missing value by discarding observations or through mean/median/mode imputation. Other sophisticated methods include the use of prediction models such as Linear/Logistic Regression or the most common, KNN.

Mean / Median / Mode ImputationPrediction ModelsKNN
Know More
ESC
100 pages indexed · Esc to close