Data can be anything. If you’re collecting heights, then it would be a bunch of measurements in feet and inches. Or, if you’re collecting house prices, then it would a bunch of price values. The data is only valuable if you can see a pattern. This pattern can help predict how unknown values of the concerned data item (height, house price etc) might behave. Therefore, analyzing data patterns is crucial to any evidence based science.
We expect there to be a central tendency in all kinds of data. Take height, for example. Let’s say our sample is only adult Indian men. Then, we expect the height to be around 5 feet 10 inches. So that “5 feet 10 inches” represents the central “tendency” in our dataset i.e. any height picked at random from the sample is expected to be “around” 5 feet and 10 inches.
As another example, in my neighborhood, we expect the price of a house to be around 2-3 crore rupees. So, that’s the central “tendency” for house prices in my neighborhood. Any house picked at random from my neighborhood is expected to be priced “around” that 2-3 crore rupees mark.
In the end, central tendencies give us an idea about the “expected value” in a dataset.
Some popular measures of central tendency are:
- Average (The mean)
- Mode (The most frequently occurring value)
- Median (The middle value)
Which measure of central tendency is more useful depends on the dataset.
Consider the following datasets:
1) 20000, 50000, 40000, 30000, 10000. Here, the median (middle value) is most representative of the central tendency.
2) 20000, 20000, 20000, 120000, 20,000. Here, mode (most frequently occurring value) is most representative of the central tendency.
3) 20000, 50000, 25000, 35000, 40000. Here, neither mode nor median can represent the data’s central tendency. In this case, the best central measure is the average value (mean).
Measures of dispersion
It is obviously quite useful to determine the central tendency of a dataset. But another useful characteristic of data is the dispersion or the spread. How spread out is the data? There are many measures of dispersion. For example, range, which is the difference between the highest and lowest values in the dataset. We may ask “what is the price range of a Rolex watch?” The answer is the difference between the cheapest price value and the most expensive price value. As you can see, by measuring range, we try to form an idea of how spread out the values are.
Another measure of dispersion is standard deviation. Consider the following datasets:
- 10, 30, -10, 20
- 10, 9, 8, 7, 6
Just by eyeballing the two datasets, you can tell that there is more dispersion in the first dataset than the second. This should reflect from each of their standard deviations, since standard deviation is a measure of dispersion. Indeed, it is reflected when we calculate the standard deviations for the two datasets using the formula. We get approximately 14.79 for the first dataset and 1.414 for the second dataset. Clearly, 14.79 indicates greater dispersion than 1.414.
Questions to think about
1) If you have data for the salaries of entry-level HR professionals, what is the best measure of central tendency? The mean, median or mode?
2) When you’re calculating income disparity in a region, which measures of central tendency and which measures of dispersion are most useful? Why?
Thank you for reading.