// miscellaneous methods

Consolidation of Datasets

Among the first steps of preprocessing is to consolidate the datasets so that modeling can be done, as often the required data can be scattered in various datasets and needs to be consolidated. However before any such consolidation can be done, the variables need to be identified. The variable identification process first requires identifying the Predictor and the Target variable. Here the predictor variable is the Input and the target is the Output. For example, we have a dataset where the variables are the subject’s income, amount of pending debt, whether owns a house or not, whether is married or not and if the subject has been unable to repay the loan or not. Here the Target variable is whether the subject has been unable to repay the loan or not while others can be used as input or the predictor variables. The identification of the target variable is crucial in consolidating the datasets.

The datasets can be consolidated in many ways, here two methods will be discussed: Appending and Merging.

Appending Datasets

Appending is a method of consolidating datasets where the data from various datasets can be combined vertically. Appending is required generally when the same data is spread across various datasets. A typical scenario is when the data is stored in spreadsheets and as it has a very limited capacity to store, multiple spreadsheets may be there and their data is required to be appended or ‘concatenated’ so that any analysis can be done.

Dataset A and Dataset B stacked vertically to form a combined dataset A over B — Appending stacks datasets vertically, one on top of the other.

Appending highly relies on the name of the variables where the data from each variable is “stacked” on each other to create an appended dataset. One must be vigilant that in certain software, the appending process is case sensitive and if you have two datasets wherein dataset A there are three variables - ‘NAME’, ‘INCOME’, and ‘DOB’ - while in dataset B the three variables are - ‘Name’, ‘INCOME’, and ‘DOB’. In reality the name variable has the same information, however as the process is case sensitive, the data under the variable ‘Name’ in dataset B will not get stacked under the data of variable ‘NAME’ in dataset A.

Also, the appending process requires the columns to be same and if there is an extra variable in either of the datasets, the appending process may need to be ‘forced’.

Dataset A with Name, Income, DOB columns and Dataset B with the same columns plus an extra Age column — Dataset A and Dataset B: B has an extra column (Age) not present in A.

Consolidated dataset showing rows from A with a blank Age column, stacked with rows from B and their Age values — The consolidated (appended) dataset, with a blank Age column for the rows that came from Dataset A.

Merging Datasets

Merging is a way of consolidating datasets where the datasets are combined horizontally. Here the variables are added to the existing observations, i.e. transferring columns from one dataset to another (unlike appending where rows are transferred), where the datasets are added to the side of one another.

Dataset A and Dataset B placed side by side to form a combined dataset A next to B — Merging places datasets side by side, combining columns.

It is important to note that at least one common variable should be there in both the datasets and may be required to have the same name in both the datasets (this precondition depends on the software you are using); also, multiple common variables can be used during the merging process.

Typical scenarios where merging is required are, for example, when we have customers where the demographic details of the patients are available in one dataset while the transaction details are available in another. Here the common variable can be the customer name or the ‘Customer ID’, and by using this common variable, the two datasets can be joined / combined / merged into one, and this dataset can be used for further analysis and modelling.

To understand the merging process and outcomes, it is crucial to understand the different types of relationship that the two datasets can share. There are mainly three types of such relationships: One to One, One to Many, and Many to Many.

One to One Relationship

One to One Relationship between two datasets is formed when for each value of a common variable in the first dataset there is only one matching value for that common variable in the second dataset.

Diagram showing values 1, 2, 3, 4 in one table mapping to exactly one matching value each in a second table — A one-to-one relationship: each value in the first dataset matches exactly one value in the second.

For example, if we have two datasets wherein dataset A - customer’s demographic detail is given - and in dataset B - there is transaction details of the customers - and we require to merge the datasets to have a complete dataset where the demographic and transaction details of each customer are provided. In this example, we can see that the two datasets share a one to one relationship.

Demographic table with Cust_ID and Income, and Transaction table with Cust_ID and Amount Paid — A Demographic dataset and a Transaction dataset sharing the common variable Cust_ID.

Merged table combining Cust_ID, Income and Amount Paid, with dashes where a Cust_ID is missing from one of the source tables — The merged table: each Cust_ID's Income and Amount Paid combined into one row.

One to Many Relationship

One to Many Relationship between two datasets is formed when for each value of a common variable in the first dataset there is more than one matching value for that common variable in the second dataset.

Diagram showing value 2 in the first table mapping to two separate matching values in the second table — A one-to-many relationship: a single value in the first dataset matches more than one value in the second.

For example, if we have two datasets wherein dataset A - customer’s demographic detail is given where for each unique customer ID their income is available - while in dataset B - transaction details are given which include multiple amounts paid and number of items purchased for each unique customer ID - and we require to merge the datasets to have a complete dataset where the demographic and transaction details of each customer are provided. In this example, we can see that the two datasets share a one to many relationship.

Demographic table with Cust_ID and Income, and Transaction table where some Cust_IDs repeat across multiple transaction rows — A Demographic dataset and a Transaction dataset where each customer can have multiple transactions.

Resulting merged demographic table with repeated rows for customers who had multiple transactions — The merged result: a customer's demographic data is repeated for each of their transactions.

Many to Many Relationship

In a many to many relationship, observations in each field (which are used to create a relationship between datasets) are included multiple times in each dataset, and we require to combine datasets horizontally where the output has all the combinations of the observations, i.e. the output is a cartesian product of rows. For example, if we have two tables with 3 and 2 records respectively, then using the Cartesian product, we have a table with 3 X 2 = 6 records.

Dataset A and Dataset B combined via a cartesian product to form a wider combined table — A cartesian product: every row of A is combined with every row of B.

In the below example we can see what happens when two datasets are used to form a Cartesian product.

Table A with two customer rows and Table B with two handler rows, combined into a four-row cartesian product table — Table A (2 customers) and Table B (2 handlers) combined into a 4-row cartesian product.

The problem with many-to-many relationship is that we get the cartesian product of rows, so we end up counting the same row more than once, causing duplications in the output dataset, which can result in incorrect results and may cost us computing resources such as space and processing speed etc.

Ways of Joining Tables

There are multiple ways to join a table, such as Inner Join, Left Join, Right Join, Full Join etc. We can understand all these different ways with the help of an example where we have two datasets: Dataset-A and Dataset-B. Now there are various ways of merging these two datasets.

Inner Join

Here only the common records from both the datasets are considered.

Left Join

All the observations that are present in Table-A, along with the observations that are common with Table-B, are considered. (A = 1, B = 0)

Right Join

All the observations that are present in Table-B, along with the observations that are common with Table-A, are considered. (A = 0, B = 1)

There are also outer joins that are sometimes used. There are four outer joins: Left Outer Join, Right Outer Join, Outer Join if Null, and Full Outer Join.

Left Outer Join

Only the observations unique to Dataset-A are considered. (A = 1)

Right Outer Join

Only the observations unique to Dataset-B are considered. (B = 1)

Outer Join (If Null)

Sometimes referred to simply as Outer Join, here only the unique observations are considered. (A = 0, B = 0)

Full Outer Join

Here all the observations from both datasets are considered. (A = 1, B = 1)

Eight Venn diagrams illustrating Two Data Sets, Inner Join, Left Join, Right Join, Left Outer Join, Right Outer Join, Outer Join If Null, and Full Outer Join — The different ways of joining two tables, illustrated as Venn diagrams.

Thus in merging, on the basis of one variable, information is looked for in another dataset, and by doing so information is added - here an addition of variables takes place.

It is important to remember that observations can be overwritten; for example, during a Left Join, observations will be overwritten to table A if there is more than one common variable in both the datasets. Therefore no common variables should be there in both the datasets except the variable through which data is being merged.

Consolidation of datasets is a common step taken during the preparation phase. Here we explored two ways of consolidating datasets - Appending, where observations are added to an existing dataset (or a new dataset is created altogether), and Merging, where variables are added to the dataset. Both methods are equally important and are put to use depending on the situation at hand. Once the datasets are consolidated, other pre-processing steps such as Missing Value Treatment, Outlier Treatment, and various Feature Engineering steps can be taken to make the data cleaner and compatible for a learning algorithm.