In Data Patterns – II we introduced the normal distribution. In this post, we’re going to introduce a related concept called the z score. The whole point of constructing the normal distribution for a dataset is to figure out the mean of the dataset and the spread around the mean. The z score measures exactly how many standard deviations a data point is away from the mean.
Formula for z score
Let x be any data point from a given dataset. The z score for a datapoint is supposed to tell us how far the data point is from the mean (μ) in terms of standard deviations (σs). In other words, the z score calculates how many times σ distributes into the distance measured by (x – μ).
Therefore, the formula is: (x – μ)/σ
In a “statistics for machine learning” class, the average score on the final exam was 62, give or take 5. Sally’s score on the final exam was 92. How “normal” is Sally’s score? Karan’s final exam score was 52. How “normal” is Karan’s score?
μ = 62, σ = 5, x (for Sally) = 92, x (for Karan) = 52
z score (for Sally) = (x (for Sally) – μ)/σ = (92 – 62)/5 = 6.
This means that Sally’s score was a whopping 6 standard deviations away from the class average/mean. So, it was definitely not normal. It was among the rarest/least expected scores.
z score (for Karan) = (x (for Karan) – μ)/σ = (52 – 62)/5 = – 2.
Now, this means that Karan’s score was 2 standard deviations away from the class average/mean. It was not exactly normal. Normal would be within 1 standard deviation from the mean. But it’s not too rare either. The “-” sign on Karan’s z score indicates that his score (52) fell below the average mark (62).
Table of area under normal curve
Think back to the following figure:
We had explained that 68% of the datapoints fall within μ±σ.
95% of the datapoints fall within μ±2σ, and
99.7% of the datapoints fall within μ±3σ
The μ±σ bracket represents datapoints that would be considered “normal.”
The μ±2σ bracket includes datapoints that are not exactly “normal” but not completely unexpected either, and
The μ±3σ bracket includes even the most rare and unexpected datapoints.
Why do we bring up this figure again? Well, because the 68%, 95% and 99.7% figures are approximate figures. To know the exact percentage of points that follow under a given region of the normal curve, we need to look at a table constructed by statisticians.
The table is a little hard to see, but you can just zoom in for clarity. Look at the 1.0 entry. For 1.0, the value is 3413. This means that 34.13% of the points fall 1 standard deviation away from the mean. But before we said that the percentage of points that fall within 1 standard deviation of the mean is 68%, not 34.13%. You’re right. We did say that. We said that because we considered both sides of the mean i.e we considered 1 standard deviation above and below the mean, which gives us 34.13 + 34.13 = 68.26%.
Similarly, notice the 2.0 and 3.0 entries. For 2.0, the value is 4772, which means 47.72%. Doubling that, we get approximately 95%. Likewise, for 3.0, the value is 4986.5, which means 49.865%. Doubling this, we get approximately 99.7%.
Ultimately, the z score (for a datapoint x, given μ) can be matched with its value in the area under the normal curve table to determine the % of points that fall between x and μ.
We’ll do some examples applying the “area under the normal curve table” in the next write up. Thank you for reading.