Statistics in Data Science(Part 1)

Nirmal Maheshwari
5 min readNov 27, 2021
Google Image

I have always been an aspiring data scientist, being a part of the Data Science community now and the way this profession is growing in the past few years, I want to help other “Aspirers” of this profession to help them grow their knowledge and finally crack the interviews. Going forward I will be writing about the main topics of how Data Science works in the industry and will try to be as much practical as I can with my current experience. Below are some of my articles written with the same aim in mind.

https://medium.com/@nirmal.maheshwari

The base of any Machine Learning problem or algorithm is the Maths behind it. Maths though is a huge ocean in itself but we will limit ourselves to Statistics which is practically used and will go into the Maths related to the algorithms as and when required.

For this part, we will make ourselves familiar with the most used word when we talk about Statistics in Data Science, “Hypothesis Testing”. A basic assumption for this is that we are familiar with our High School Mathematics, though I will refresh some of the terms with definitions when required.

Hypothesis: When we perform an analysis on a population sample -the analysis could be descriptive, inferential, or exploratory in nature -we get certain information from which we can make claims about the entire
population. These are just the claims; we can’t be sure if they’re actually true. This kind of claim or assumption is called a hypothesis and to verify this claim of any population parameter we perform Hypothesis testing. Let me explain this with the help of an easy example.

A company has 15K employees and wants to calculate the average commute time to the office, here the population is 15K employees. But it is not practically possible to know the commute time for every employee and calculate the mean. So we take a sample out of this population, let’s say 100 employees, and calculate the average commute time for them. From this sample we make a claim about the complete population of 15K employees, this claim is known as a hypothesis. (Remember this example, we will use the same below to make inference of the Population commute time)

The claim about the population can be through experience, exploration, or by something that we call Inferential Statistics. Lets us understand some terms before we go into further details of Hypothesis Testing.

Probability Density Function(PDF)

For continuous random variable (remember probability X variable), the one like Avg. commute time above, we cannot find the probability of an exact value like in case of discrete random variables, we have to talk in terms of time intervals in this case and PDFs helps in the same.

Google Image (PDF Function)

In the above graph, the probability of an employee having a commute time less than 28 is the area under the graph between 20(lowest commute time) and 28.

Normal Distribution

This distribution is an example of PDF and is the most used distribution in the industry. It is a symmetric distribution, and its mean, median, and mode(High school maths) lie at the center.

1–2–3 Rule for Normal Distribution and Z-Score

Google Image(1–2–3 Rule)
  1. 68% probability of the variable lying within 1 standard deviation of the mean
  2. 95% probability of the variable lying within 2 standard deviations of the mean
  3. 99.7% probability of the variable lying within 3 standard deviations of the mean

A Z Score also called the Standard Score, is a measurement of how many standard deviations below or above the population mean a raw score is. Meaning in simple terms, it is Z Score that gives you an idea of a value’s relationship to the mean and how far from the mean a data point is. It is calculated as (X-Mean)/Standard Deviation(Sigma)

Sampling Distribution of Sample Mean and Central Limit Theorem(CLT)

Now we know what basically taking a Sample of Population means, the distribution of the means of this sample is known as Sampling Distribution. Let us denote the mean of a Sampling distribution as μx and Population mean as μ, below is what CLT states.

  1. For n(Sample population)> 30, the sampling distribution becomes a normal distribution
  2. Sampling Distribution’s Standard Deviation (Standard Error) =𝜎/√𝑛 where σ is the population’s standard deviation and n is the sample size
  3. Sampling Distribution’s Mean (μx) = Population Mean (𝜇)

Now lets us try to compute the Avg. Commute time in our employee's example.

Sample(n) = 100, Sample mean-32.2 mins, Sample standard deviation -10 mins

Using CLT, you can say that the sampling distribution for mean commute time will have -
1. Mean = μ {unknown}
2. Standard error = 𝜎/√n~S/√n=10/√100=1
3. Since n(100) > 30, the sampling distribution is a normal distribution

Now let us try to set a claim about the probability that the population mean μ. From 1–2–3 rule, we can say that the probability of the population mean lying within 2 standard deviation of the sample mean is 95.4, here the population mean will be between (30.2, 34.2), where Z* is 2. Let's clear the terminologies with this claim.

  1. Probability associated with the claim is called confidence level (Here it is 95.4%.
  2. Maximum error made in sample mean is called the margin of error (Here it is 2 minutes)
  3. Final interval of values is called confidence interval {Here it is the range — (30.2, 34.2)}

In generalization, we can say that Confidence Interval = (𝑋̅ − Z*S/√n, 𝑋̅ + Z*S/√n ) where, Z* is the Z-score associated with a y% confidence level.

Now let us go back to our main topic Hypothesis Testing, for this part we will just see a basic introduction, and going into the next parts we will deep dive by taking some examples.

Null Hypothesis and Alternate Hypothesis

  1. Null Hypothesis(H0)- This is the status quo, Alternate Hypothesis (H1)- Challenge to status quo.
  2. H0 is always defined with = OR ≤ OR ≥ and H1 with ≠ OR > OR <
  3. We never accept Null Hypothesis, we either “Reject Null Hypothesis” or “Don’t have enough evidence to support the Alternate Hypothesis”.

Example: The revenue of Amazon India in the year 2020 was at least 14 Billion dollars. Let's create Null and Alternate hypothesis for this claim.

H0- Revenue ≥ 14, H1- Revenue<14

Now we do understand the basis of Hypothesis testing and in the next part of the series, we will be seeing different methods to make a decision or to verify the hypothesis. We will also take many examples of Hypothesis testing and will also discuss some “Statistical Tests” used in the industry.

Hope this helps all fellow “Aspirers”. Cheers :)

Update : Link for Part -2 : Statistics in Data Science(Part 2) | by Nirmal Maheshwari | Dec, 2021 | Medium

--

--

Nirmal Maheshwari

A Common Data Scientist who was once a aspiring one. Here to make it easy for other aspiring data scientist to crack the interviews.