In Data patterns – I we touched on the importance of data patterns in evidence based sciences. Then we introduced the absolute fundamentals of data patterns — the concepts of central tendency and dispersion. In this post, we discuss another fundamental concept — the normal distribution.
Concept of “normal”
The normal distribution looks like this:
Let’s explain a bit. Notice that the bulk of the curve’s area is around μ. This is normal behavior with most data items in the real world. Take adult Indian male heights, for instance. Most of the values will fall near the 5 feet 10 inches mark. A few will deviate slightly, and be 1 standard deviation away from μ. Even fewer points will be 2 standard deviations away. Finally, points that are 3 standard deviations away will be so exceedingly few that they’d be considered extremely rare.
Using formulas, statisticians have been able to calculate the exact percentage of data points that are within the “normal” range i.e. they fall within 1 standard deviation from the mean. The number is 68%. Now, if we also considered data points that are a little less “normal” in their value, and fall within 2 standard deviations of the mean, the percentage goes up to 95%. Finally, if we also consider the extremely rare points, the “abnormal” ones — the ones that are 3 standard deviations away from the mean — then we cover 99.7 % of the data points.
Properties of the normal curve
The curve is symmetrical
This is important because for any mean/average mark, going above or below the mark should show the same reduction in data point likelihood. For example, if 5 feet and 10 inches is the average mark, then going above the 5 feet 10 inches mark show the same reduction in a height data point’s likelihood as going below the 5 feet 10 inches mark. If this is not the case, then the 5 feet 10 inches mark is not a good average.
The curve is unimodal
There can only be one mean/average or μ. There can only be one peak to the curve. There can’t be more than one peak. There can’t be two averages.
The maximum ordinate occurs at the center
Ordinate is a technical term for y-value. If the data values are plotted on the x axis, and their corresponding percentage likelihoods are plotted on the y axis, then the maximum percentage likelihood or y-value will occur in the middle, right on the mean. Thus, the maximum ordinate occurs at the center.
The curve is asymptotic on the x-axis
The term “asymptotic” means that the curve never hits 0. You can see that even on the ends, the curve never falls flat on the axis. This makes sense because the probability of a data point, no matter how rare, cannot be 0. 0 means the data point cannot occur at all. This is never the case. Even when data points are extremely rare and fall more than 3 standard deviations away from the mean, there remains a very very small likelihood of occurrence. There is never 0 likelihood of occurrence. Thus, the curve never hits 0 even on the edges of the x-axis. It “asymptotes” on the x-axis.
The total area under the curve covers 100 % of the data points
We’ve already seen that the percentage of points covered by μ+σ, μ+2σ and μ+3σ are 68%, 95% and 99.7% respectively. If you go beyond μ+3σ then you will cover even more points. Ultimately, the area under the entire curve from end to end, including the rarest of the rarest occurrences, will cover 100% of the points.
The curve is bilateral
Exactly 50% of the area falls above and below the mean.
The curve is the most frequently encountered data model in behavioral sciences
I love studying behavior. This should be no secret, if you follow this blog. Therefore, I’d like to mention that the most useful way to model data in behavioral sciences is using the normal curve. The reasons are obvious. Most behavioral parameters behave according to the normal curve. They have a mean\normal value, and as they deviate more and more from the mean, they are treated as increasingly abnormal.
We’ll discuss more about the normal distribution in the next post. Thank you for reading.