Statistics in Data Science(Part 2)

5 min readDec 6, 2021

The below post is the continuation of my previous post

Statistics in Data Science(Part 1)

I have always been an aspiring data scientist, being a part of the Data Science community now and the way this…

commondatascientist.medium.com

Assuming we are clear with the basics of Hypothesis Testing and Inferential Statistics from my previous post, in this we will be discussing some industry-wide methods for deciding whether we should “Reject the Null-Hypothesis” or we don’t have “Enough evidence to support Alternate Hypothesis”, remember we never say that we “Accept Null Hypothesis”.

We will be following the below example for understanding purposes.

You are an owner of multiple stores across India selling Coolers and you want to know the mean demand of Cooler units per month per store during summer as due to heatwaves you are expecting the demand to go up this time. Historically the mean units of cooler sold per store per month are 120 and you want to check the assumption that the average units required this month will be different from 120 or not.
Some facts about the population:
Population mean(μ) -120, Standard deviation(𝜎)-36
Let's try to formulate the Null and the Alternate hypothesis of the example.
As the Null hypothesis is always the status quo and the alternate hypothesis is the negation of that.
H0- μ = 120, H1- μ ≠ 120

Before going on to the methods let's make the perception clear how do we make a decision. We decide based on the critical regions if the mean of the sample lies within the critical region we Reject the null hypothesis and if it is outside then we Fail to reject it. You can tell the position of the critical region based on the ‘sign’ in the alternate hypothesis. Right side critical region starts at UCV(Upper critical value) and left side critical region starts at LCV(Lower critical value)

≠ in H₁ → Two-tailed test → Rejection region on both sides of the distribution
< in H₁ → Lower-tailed test → Rejection region on the left side of the distribution
> in H₁ → Upper-tailed test → Rejection region on right side of the distribution

Now let us understand the most widely used method for making a decision, the p-value method

Let’s say when this year is passed after the sales are over, you take a random sample of 36 stores, and let’s say the mean value of the sample comes out to be 140.5 coolers which differs than our Null hypothesis value. Now the question arises what is a p-value? Well in layman's terms it is simply the probability of the Null hypothesis to be true, so the higher the p-value higher the chances that we cannot reject the null hypothesis.

To be technical p-value is the probability of an observed data point(Sample mean) to be of extreme (more unlikely observations) assuming the null hypothesis is true. Let's see the graph below.

We can see in the graph that higher the p-value, the sample mean will be closer to the population mean and more are the chances of not rejecting the Null hypothesis.

Now let’s see the steps on how can we make a decision using the p-value method.

1. Calculate the value of the z-score(explained in the previous post) for the sample mean point on the distribution
2. Calculate the p-value from the cumulative probability for the given z-score using the z-table
3. Decide based on the p-value (multiply it by 2 for a two-tailed test) with respect to the given value of α (significance value or confidence level). It will be 5% if we are 95% confident.This is the same value we will take for our example

Let’s see how can we find the cumulative probability by reading the Z-table(https://www.ztable.net)

If Z-score is positive, that means the sample mean is on the right side of the mean, ex- Z-score is+2.02, according to the table the cumulative probability at this point is 0.9783

For one-tailed test → p = 1–0.9783 = 0.0217
For two-tailed test → p = 2 (1–0.9783) = 2 * 0.0217 = 0.0434

If Z-score is negative, that means the sample mean is on the left side of the mean, ex- Z-score is -2.02, according to the table the cumulative probability at this point is 0.0217

For one-tailed test → p = 0.0217
For two-tailed test → p = 2 * 0.0217 = 0.0434

Coming back to our example

μx = μ = 120 and Standard error = 𝜎/√𝑛 = 36/√36=36/6=6
Z = (x-μx) / (σ/√x) = (140.5–120)/6 = 3.41
Cumulative probability from Z table for 3.41 is .99968
p-value for 2-tailed test= 2*(1-.99968) = 0.00064 = 0.064%

the p-value is less than our significance level which is 5%, hence we will be “Rejecting the Null hypothesis” in this case which means the average demand of coolers was not equal to 120 this time.

Critical Value Method

The first step of the critical value method is to find Zc. To do that, you calculate the cumulative probability of UCV(upper critical value) from the value of α, which is further used to find the z-critical value (Zc) for UCV.

As this is two tailed test 2.5% of the critical region will lie on both side of the distribution, cumulative probability of UCV is (1–0.025) = 0.975.
μx = μ = 120 and Standard error = 𝜎/√𝑛 = 36/√36=36/6=6
From Z table the Z-score for 0.975 probability is 1.96
CV = μ+(Z*Standard Error), UCV = 120+(1.96*6) = 131.76, LCV = 120-(1.96*6)=108.24. As our sample mean is above the UCV, we will be “Rejecting the null hypothesis”

These are the two most widely used methods in the industry, I hope the example will be able to help you understand the methods better. For the next post in this series, we will see some more industry demonstrations of hypothesis testing and try to establish some intuition of A/B testing.

Statistics in Data Science(Part 2)

Statistics in Data Science(Part 1)

I have always been an aspiring data scientist, being a part of the Data Science community now and the way this…

Now let us understand the most widely used method for making a decision, the p-value method

Critical Value Method

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Nirmal Maheshwari

Responses (1)