home/basic statistics

// pillar 01 · foundation

Basic Statistics

The statistical concepts you need to understand before you can responsibly play with data: Descriptive and Inferential Statistics, the building blocks for how any data analysis gets done.

Summary statistics and chart fragments representing the descriptive and inferential foundations of data analysis
stage 01 / 03concept introduction

In this section, I'll talk about the basic statistics such as Descriptive and Inferential Statistics, which are the building blocks for understanding how data analysis can be performed.

Let me first start from the beginning and explain what data is.

Data, in very simple terms, is information that has been quantified, and when this information is processed and stored by a computer, this "Computer Data" is what is used in Data Analysis. At the most rudimentary level, this data is made up of binary digits: 0 and 1. But in today's world, we have different manifestations of this data in the form of text documents, images, videos, software, and more.

With the revolution in computing storage and processing power, the amount of data being generated is humongous, and often there is a need to find some method in the madness. That's where Data Analysis kicks in.

Before knowing the type of analysis that can be done on the data, it's imperative to understand the types of data that exist. Broadly, there are two types of data: Qualitative (Categorical) and Quantitative (Numerical).

DATA QUALITATIVE QUANTITATIVE
Two broad types of data: qualitative (categorical) and quantitative (numerical)

Qualitative Data, also known as Categorical Data, is generally non-numeric in nature. This kind of data contains words and, as such, cannot be quantified. Examples of Qualitative Data can be Location, Gender, Colour, Shape, and so on.

Qualitative Data can be of three types: Binary, Nominal, and Ordinal.

  • Binary Data is the one that contains only two, mutually exclusive categories. An example of it can be the results of a coin toss, which will be Heads or Tails.
  • Nominal Data is the kind of Qualitative Data that contains no order (just like Binary Data) but contains more than two categories. Each of these categories is mutually exclusive, and no category is essentially better than the other. An example of Nominal categories can be Colours, where no colour is better than the other. It's important to note that Nominal Data may be represented in the form of numbers, such as 1 for Red and 2 for Blue, but these numbers are merely labels and have no weight, so neither is number 1 "better" than number 2, nor can we understand the difference or "distance" between the two numbers.
  • Ordinal Data is where the categories can be put into a particular, systematic, logical order, and these values also have no weight. Examples of this can be the Top 5 Poorest Countries, or clothing sizes (Small, Medium, Large, and so on). Here it does not tell how much "distance" there is between the intervals or values.
QTY INTERVAL RATIO - by precision - QTY CONTINUOUS DISCRETE - by divisibility -
Quantitative data, classified two independent ways: precision (interval/ratio) and divisibility (continuous/discrete)

Quantitative Data is numerical in nature and, as the name suggests, is the kind of data that can be quantified. We can further divide Quantitative Data into two sub-categories: Interval and Ratio. Both these classes of data have weight and contain information about their relative value.

  • Interval Data is, in a way, like Ordinal Data, with the difference that the intervals between the values are equally split, but there is no true zero. An example is temperature in degrees Celsius, where the gap between values is equal and quantifiable, but 0°C does not mean an absence of temperature.
  • Ratio Data includes an absolute (true) zero, where zero means a complete absence of the quantity. An example is the height or weight of people, where 0 really means none of the quantity and ratios between values are meaningful.

Quantitative Data can also be categorized into two types: Continuous and Discrete.

  • Continuous Data is where the values are divided in fractions and can take any value within their range of variation, such as height or temperature.
  • Discrete Data is where the values are isolated within their range of variation and are measured across a set of fixed values, for example, the number of students in a class.
SAMPLE POPULATION
A population, with a smaller sample drawn from it for analysis

To understand the type of analysis that can be done on the various kinds of data mentioned above, the prerequisite is to have a brief understanding of what we mean by Statistics, Population, and Sample.

Generally, the first thought that comes to mind when someone says Population is the number of individuals in a country, while Sample means a small part of this population used to represent the entire population. If you have this understanding of Population and Sample, you're not very far from the actual meaning of these terms in Statistics.

Population means everyone comprising a particular group as a whole. It can be described as the individual or group of individuals that make up everyone and everything that is the subject of a statistical observation. It's important to note that a population doesn't have to be large, and it can be as small as 2 individuals if they represent the whole of the category in question. For example, imagine a manufacturer produced only 3 units of a special limited-edition car; if we want to find the average distance covered by that model, our population is just those 3 cars, and it won't include any other car, only this particular model. While Population represents the whole, Sample is nothing but a subset of this Population. Various kinds of analysis are done on this sample data, and the values and inferences drawn from it are known as Statistics. For example, if we compute a mean, the mean drawn using the Population will generate a Parameter, while the mean drawn using a Sample will be known as a Statistic.

As it's often very difficult to measure every individual in a population, various methods for selecting samples are used, such as:

  • Simple Random Sampling: here, random means unbiased, so that every member of the population has an equal opportunity of being selected in the sample. It's often used in the analysis of customer satisfaction.
  • Representative (Stratified) Sampling: this is also random, but it follows the same proportions and patterns present in the actual population, so that the sample can represent the larger population in terms of specific characteristics. An example would be generating a representative sample of the population of Mumbai, where 100 people are randomly selected, making sure that of these 100, 54 are male and 46 are female, with members in each category selected randomly. This way, gender gets represented the way it exists in the population. (As per Mumbai City District 2011 Census Data)
  • Convenience Sampling: this kind of sampling is done keeping ease of access and people's willingness to participate in mind. It's the sort of sampling we often encounter day to day, with company representatives handing out surveys to fill out. Convenience Sampling is not a flawed or wrong way of collecting samples, and it's perfectly acceptable as long as it's able to represent the population of interest.
  • Cluster Sampling: this method of sampling is generally used during marketing or for exit polls. Here, there can be variation among subgroups even though they're essentially similar, so this sampling suits a fairly particular kind of analysis. An example would be predicting Delhi's election results, where we divide Delhi into 6 zones, further into 3 localities each, and then do random sampling from any 2 blocks in every locality.

With this understanding in place, we're good to go and dive into the world of statistics, where I'll be discussing the basic statistics that are required to perform any advanced analysis on data.

Chalkboard covered in handwritten statistical formulas and distributions
stage 02 / 03theory

Understand the equations first.

In this section, equations, distributions, etc. that are involved in calculating the various kinds of statistics that we use to analyze the data are discussed. From calculating a simple arithmetic mean to comparing means of two data sets, how certain calculations are performed, what inferences we can draw from them, and how these inferences can be used to perform more sophisticated kinds of data analysis, all such questions are solved in this section. We can simply write a one-line code and mug up shortcuts to make us remember what the data indicates given a value of a statistic, but to have a deep understanding of what we are doing and, most importantly, why we are doing something, it is important to understand the theory behind it.

MeanMedianModeVarianceOutliersSkewed DataParameterStatistic
Explore Theory
Code running on a laptop screen, representing applied statistics in Python and R
stage 03 / 03application

Then put it to work.

In this age of computers, it becomes imperative to apply the formal knowledge through machines, as it produces faster results which are more reliable and robust. With the understanding of the theory behind the statistics, we can take help of computers and use them to their potential. The knowledge of basic statistics can be applied to very large datasets that require highly complex calculations and also require a lot of time if performed manually. In today's world, it is of paramount importance to have the right balance between having the behind-the-scenes knowledge and the knowledge of the application part. We can use various software, and in this section, I discuss the codes in languages like R and Python, through which various basic statistics discussed in the theory section can be applied to very large datasets by using a fairly simple one-line code.

PythonRNumPyPandasread_excelgroupbyAggregationBar Chart
Explore Applications
ESC
100 pages indexed · Esc to close