We learn about area in elementary school. The most fundamental notion is land area. Look at the figure below.
We can think of the rectangle as a slab of land. It’s dimensions are 5 and 2, in terms of blocks, and the total area is 5×2 i.e 10 blocks.
Land area is a fairly direct application of area. Even if your land isn’t perfectly rectangular, you can approximate it using rectangles, by either break it up into smaller rectangles or by approximating the whole stretch using one big rectangle. Either way, the dimensions of the rectangle(s) can give you an area estimate.
We learn this in elementary school. Fast forward to adulthood, and we see area being used to model interrelationships in data. How does a concept go from modelling physical spaces to modelling data behavior? That’s what school should have taught us, post elementary school — probably, in senior highschool. But it didn’t, and it’s actually quite simple.
Let’s say we’re interested in depression data, and we characterize depression in terms of two primary dimensions, unhappiness and stress level. So, each person in the test sample would be an (x, y) data point where x = that person’s unhappiness, and y = that person’s stress level. After putting all the (x, y) points on a graphsheet, let’s say we obtain the following plot.
What is the area of that rectangle? In the land example, one dimension was width and the other was height. The unit was the same for both. Meters, Kilometers, whatever it may have been, it was still the same for both dimensions. So, it made sense to do width x height for the total area in terms of the land’s primary dimensions.
But here, the unit for stress level and unhappiness are not the same. Let’s say the unit of stress level is sl, and the unit of happiness is uh. 1 unit of sl is not equal to 1 unit of uh. So then why does it make sense to conceptualize the relationship between them as sl x h at any given region in that rectangle?
Think about it. The assumption is that stress and happiness are related. If you look closely, even the data shows that they are related. Except for a few datapoints near the top, which seem to go haywire, the rest show a pretty steady trend. As stress increases, so does unhappiness. Thus, it seems that stress value can be derived from unhappiness value. Therefore, you could say that stress is ultimately a function of unhappiness. In other words:
sl = f(uh)
Redoing the plot with that in mind, we obtain the following:
Now, both dimensions are in terms of uh (unhappiness). When we talk about land, the distance covered along one dimension (width) and the distance covered along the other dimension (height) gives us the total area of that land. Similarly, here, unhappiness along one dimension (direct unhappiness) and unhappiness along the other dimension (unhappiness through stress) give us the total area of the depression characterized by unhappiness and stress.
Why is a rectangle the best approximator for area? Why not a circle (see the figure below)?
Well, for the above mass, a circle can work as a pretty good approximator. But the more data points we collect, the more the mass starts to like this:
Look at the figure. The points represent sl (stress level) scores. It shows that when you collect more and more scores for sl, most of them collect around a “normal” value range. A few of them deviate from that range, as you can see on the sides.
This is true with most things in life. For example, there is a normal height range and if you collect height scores they will mostly collect around that normal height range. Only a few will deviate, here and there, to the sides. Similarly, weight, unhappiness ratings, so on and so forth.
The figure above is a curve not a rectangle. So, when we say area, we’d be interested in the area under the curve. What does that tell you? This time the area tells you about the pattern along one dimension, not the relationship between two dimensions.
The pattern here is, well, maximum mass around the center/normal range and tapering on the sides (which shows very little mass when you move away from the normal range). This pattern w.r.t to one dimension is another aspect of data behavior. Thus, we see that area is once again a useful concept in modelling data behavior. This time, in modelling the pattern along one dimension rather than the relationship between two dimensions.
We can approximate the “normal” curve like so:
Notice that we have used thinly sliced rectangles to approximate the area at different mass regions of the curve. It would be very hard to carry out the same approximation using circles. This should tell you why rectangular areas are more popular than circular areas, in general.
Thus, we see why the area of a rectangle is such a popular feature in data science. It can be used to model all sorts of data behavior. In this particular post, we saw how area can be used to model relationship between two dimensions of data, and also the pattern along one dimension.
Thank you for reading.