// application · r

Inferential Statistics in R

Inferential statistics are used to draw inferences from the sample of a huge data set. Random samples of data are taken from a population, which are then used to describe and make inferences and predictions about the population. In the Theory section, various Inferential Statistics were explored and in this blog, all those inferential statistics will be put to use using R.

In this section, the following topics will be explored:

Z Scores & Z-Test
t-Tests
F-test
Correlation Coefficients
Chi-Square

Z Scores, Z-Test

Z-Scores are used to calculate the probability of a score occurring within our normal distribution. This helps us to compare scores of two or more different normal distributions.

Z Value

Import Dataset: We import a hypothetical data set having exam scores of students.

Z_ScoresData <- read_excel("C:/Users/user/Desktop/Data Sets/Marks_Scored.xls")
View(Z_ScoresData)

Z_ScoresData dataset with Student and Score columns — Output: the Z_ScoresData dataset.

Calculating Z Score

The following codes can be used to compute Z scores. As there is no inbuilt function to calculate z scores, therefore we will first calculate the parameters such as population standard deviation, population mean etc.

Population Standard Deviation.

Scores <- Z_ScoresData$Score
pop_sd <- sd(Scores)*sqrt((length(Scores)-1)/(length(Scores)))
pop_sd

[1] 8.06807

Population Mean.

pop_mean <- mean(Scores)
pop_mean

[1] 72.5

Z Score.

Z_Score <- (Scores-pop_mean)/pop_sd

Z score in Dataset: We now combine the calculated z scores column with our dataset.

New_ZScoreData <- cbind(Z_ScoresData,Z_Score)

New_ZScoreData dataset with an added Z_Score column — Output: New_ZScoreData with the Z_Score column.

Exporting the Data.

write.csv(New_ZScoreData,"C:/Users/user/Desktop/Data Sets/New_ZScore.csv")

Finding Percentage / Area Under the Curve

We can find out the percentage of people who scored above 70. We take the mean and standard deviation to calculate the area under the curve. Here we use the following code for finding the area under the curve, to find the percentage of students who scored more than 70 marks.

Plessthan70 <- pnorm(70,pop_mean,pop_sd)
Plessthan70

[1] 0.3783324

Finding the area of students score>70.

Pmorethan70 <- 1-Plessthan70
Pmorethan70

[1] 0.6216676

62.16% people scored more than 70 marks.

Z Test

Z-test is used to test whether the two datasets are similar or not.

Import Dataset: We take a population dataset and a random sample of the population dataset. We first import the population dataset.

Height1 <- read.csv("C:/Users/user/Desktop/Data Sets/Heightof200ppl.csv")
mean(Height1$Height..in.cm.)

[1] 164.075

We now import a sample dataset.

HeightSample <- read_excel("C:/Users/user/Desktop/Data Sets/HeightDataSample.xls")
var(HeightSample$`Height(in cm)`)

[1] 93.60324

Performing Z Test: Again, there is no inbuilt function in R for Z test, therefore we will create the function by running the following code in the editor window. This function will be saved in the global environments and can be used for other datasets also. Here we will make use of the mean of the population and variance of the sample dataset to see if the two datasets are significantly similar or not.

z.test = function(a,mu,var){
  zeta=(mean(a)-mu)/(sqrt(var/length(a)))
  return(zeta)
}

After creating the function, we will calculate the Z statistic for the sample data.

a<- HeightSample$`Height(in cm)`
z <- z.test(a,164.075,93.60324)

[1] -1.190425

We will now calculate the p-value from the z statistic to decide if our null hypothesis is true or not.

p_value <- 2*pnorm(-abs(z))

[1] 0.2338793

In the output, we get the Z Statistic to be at -1.190425 while the p-value comes out to be 0.2338793. As discussed in Z scores, Z test and Probability Distribution, our null hypothesis in this scenario will be that both the datasets are significantly similar. If we consider the significance level to be at 5%, then to accept the null hypothesis, our p-value should be more than the chosen significance level. In our example, the p-value is well above 0.05 (5%) thus the Z-Test correctly indicates that the means of both the datasets are the same and are not statistically significantly different from each other.

t-Test

t-Test is used to see whether two groups are similar or not. Z-test is also used for the same purpose, however, the difference between these tests is that the Z-test is used when the sample size is greater than 30, whereas t-Test is used when the sample size is less than 30. The difference has been explored in Brief Intro to T Test. There are different types of t-Tests for different scenarios and we put them to use below.

Two-sided One-Sample t-Test

Import Dataset: We will be working on a hypothetical Diamond dataset which has the information of the diamonds that were sold in a shop.

DiamondData <- read_excel('C:/Users/user/Desktop/Data Sets/DiamondData.xls')
View(DiamondData)

DiamondData dataset with IDNO, WEIGHT, COLOR, CLARITY, RATER and PRICE columns — Output: the Diamond dataset.

Perform Two-Tailed t-Test: We perform a One-Sample t-Test where we try to find that whether the mean of the variable 'WEIGHT' is statistically greater than 0.5 or not. Since this is a two-sided one-sample t-test, our Null Hypothesis is that the mean of 'WEIGHT' equals 0.5, and the Alternative Hypothesis is that it is not equal to 0.5. We now run a One-Sample t-Test to test this hypothesis.

t.test(DiamondData$WEIGHT,mu=0.5,alternative="two.sided")

One Sample t-test output for DiamondData WEIGHT against mu=0.5 — Output: One Sample t-test.

Unlike Python, R gives the value of mean in the output of t-test itself and we don't have to calculate it separately.

In our case the p-value comes out to be 0.193 which is greater than 0.05 (5% significance level) therefore, we fail to reject the null hypothesis that the mean equals 0.5; the difference from 0.5 is not statistically significant.

Independent t-Test

Import Dataset: Here we import an arbitrary, hypothetical dataset that has ID, age and their gender as variables.

AgeData = read.csv("C:/Users/user/Desktop/Data Sets/data1.csv")
View(AgeData)

AgeData dataset with patid, age and gender columns — Output: the AgeData dataset.

Our Null hypothesis is stated as below:

H0: Mean of Age for Males and Females are significantly similar.

Running Independent t-Test: We run an Independent t-Test using the function t.test. For independent t-test we mention var.eq = False(F) to specify for independent test. Var.eq is used to tell the program whether the variables have equal variance or not.

t.test(AgeData$age~AgeData$gender,mu=0,alt="two.sided",var.eq=F)

Welch Two Sample t-test output for AgeData age by gender — Output: Independent (Welch Two Sample) t-test.

Our null hypothesis is that both groups are statistically significantly similar. Here, the p-value is greater than 0.05, therefore, we accept the null hypothesis that these two groups are significantly similar. Even though the sample means differ, the difference between them is not statistically significant and can be attributed to sampling error.

Paired t-Test

Import Dataset: We import a hypothetical dataset that has marks of students in two different tests. Here we presume that they are the same tests undertaken over a period of time.

ScoresData =read_excel("C:/Users/user/Desktop/Data Sets/Student Test Scores.xls")
View(ScoresData)

ScoresData dataset with Student, Test A and Test B columns — Output: the ScoresData dataset.

Running Paired t-Test: We need to find whether the marks/score of the students have changed over time or not. For this, we compare the mean of the two tests and see if they are statistically significantly different from each other or not. Here our null hypothesis is that both test scores are significantly similar.

t.test(ScoresData$`Test A`,ScoresData$`Test B`,mu=0,alternative = "two.sided",paired = T)

Paired t-test output for ScoresData Test A and Test B — Output: Paired t-test.

We find that the p-value comes out to be 1.217e-06 which is very less than 0.05 (significance level of 5%). Therefore, we reject the null hypothesis i.e. these test scores are significantly different from each other.

F-test

Different types of F-Tests have been discussed under the theoretical blog for F-Tests, however, the most common F test is One-Way ANOVA. We can use the Age data used in the Independent t-Test and find if the age of males and females is statistically significantly different from each other or not.

ANOVA1 <- aov(age~gender,data=AgeData)
ANOVA1

ANOVA call output for age by gender — Output: the ANOVA1 call.

We now use the summary command.

summary(ANOVA1)

summary(ANOVA1) output with F value and Pr(>F) — Output: summary(ANOVA1).

Correlation Coefficients

Import Dataset: To come up with correlation coefficients, we used the diamond dataset that we have used above for performing Two-Tailed t-Test.

DiamondData <- read_excel('C:/Users/user/Desktop/Data Sets/DiamondData.xls')
View(DiamondData)

DiamondData dataset re-imported for the correlation section — Output: the Diamond dataset.

Scatter Plot

We create a scatterplot to study the correlation between two variables. In this example we chose two variables 'WEIGHT' and 'PRICE' from our Diamond dataset.

plot(DiamondData$WEIGHT,DiamondData$PRICE,main="Scatter Plot for Correlation",xlab="weight",ylab="Price")

Scatter plot of Price against weight, titled Scatter Plot for Correlation — Output: scatter plot of Price against weight.

The scatterplot shows a positive correlation between the two variables and thus there is a direct relation between Price and Weight of the diamond. An increase in weight will correspond to an increase in the Price of that diamond.

Calculating Correlation Coefficient

If we need to properly quantify the relationship between the two variables then rather than going for a visual approach, we are required to calculate correlation coefficient. We do this by using a function called cor.

cor(DiamondData$WEIGHT,DiamondData$PRICE)

[1] 0.954146

Correlation Coefficient Matrix

If we have a lot of variables then rather than calculating the correlation coefficient for each combination of variables we can come up with Correlation Coefficient Matrix. Here all the diagonal values will be 1 while correlation coefficient will be there for all the combination of the numerical variables. There are two methods of calculating Correlation Coefficient and its matrix - Pearson and Spearman.

Pearson correlation coefficient: Correlation Coefficient matrix using Pearson's method.

cor(DiamondData[c('IDNO.','WEIGHT','PRICE')], use="complete.obs", method="pearson")

Pearson correlation coefficient matrix of IDNO, WEIGHT and PRICE — Output: Pearson correlation coefficient matrix.

Spearman correlation coefficient: Correlation Coefficient matrix using Spearman's method.

cor(DiamondData[c('IDNO.','WEIGHT','PRICE')], use="complete.obs", method="spearman")

Spearman correlation coefficient matrix of IDNO, WEIGHT and PRICE — Output: Spearman correlation coefficient matrix.

Covariance Matrix

Covariance is a measure of correlation (Note that the correlation coefficient is a scaled form of covariance thus we get Correlation Coefficients when we standardize the covariance) and has been mentioned in the theoretical blog of Correlation Coefficients.

cov(DiamondData[c('IDNO.','WEIGHT','PRICE')], use="complete.obs")

Covariance matrix of IDNO, WEIGHT and PRICE — Output: covariance matrix.

Chi-Square Test

Import Dataset: Unlike Correlation Coefficients, Chi-Square is used to test the level of association between two categorical variables. To perform a chi-square test in R, we use a hypothetical dataset where we have two categorical variables Gender and Smoke.

SmokeData <- read_excel('C:/Users/user/Desktop/Data Sets/SmokingData.xls')
View(SmokeData)

SmokeData dataset with ID, Gender and Smoke columns — Output: the SmokeData dataset.

Frequency Table

In the blog, Chi Square under the theory section, (where the chi-square value is calculated without the use of any application), the need and importance of frequency table has been explored.

table2 <- table(SmokeData$Gender,SmokeData$Smoke)
table2

Frequency table of Gender against Smoke — Output: frequency table2.

Running Chi-Square Test

Our objective is to find if there is any relation between Gender and the habit of Smoking and as both are categorical variables, we perform a chi-square test where our null hypothesis is that these two variables are independent (not related) of each other.

CHI <- chisq.test(table2,correct = T)
CHI

Pearson's Chi-squared test with Yates' continuity correction output — Output: the CHI test result.

We now use the attribute command.

attributes(CHI)

attributes(CHI) output listing names and class — Output: attributes(CHI).

We can now come up with the expected table as explored in the theory section.

CHI$expected

CHI expected table — Output: CHI$expected.

The p-value comes out to be 0.48251 and as the p-value is greater than 0.05 therefore, we fail to reject the null hypothesis, i.e. there is no statistically significant association between the two variables.

In this blog, we applied the concepts explored in the theory part of Inferential Statistics. R is a powerful tool and can be used for bivariate analysis using various inferential statistics. Various bi-variate analysis can be performed that have been explored in Descriptive Statistics in R and can be put to use to better understand the data. In the next section, R will be used to apply the various concepts of data preparation explored in section two of Theory - Data Exploration and Preparation.