In many cases, the objective of machine learning is to reduce the error between the data prediction and the true prediction. For example, we might know that the height of an individual increases during adolescence. This is the true prediction. But the data we currently have may show that height increases during late childhood. This could be for many reasons — for one, we may have more data from the late childhood group without knowing that’s the case, which is why the data seemes to show that late childhood is the period when height increases. Getting back to the prime objective, it is to reduce the error between the data prediction and the true prediction, and that is what optimization theory is all about.

Initial objective function 

Look at the following function

F(p) = Ȳ – Y

This function outputs the error between Ȳ (true prediction) and Y (data prediction). It is the objective function because that’s the error that we’re interested in. More specifically, that’s the error we want to minimize. Now, “p” in F(p) is the function parameter, which is a value that the function depends on. It’s usually some characteristic of the data. For example, if our data is heights, then p may be: probability of a certain age given the height. Depending on what p is, Ȳ and Y either get closer to each other or move further apart from each other. 

Gradient descent

“Gradient” is a calculus term. It tells us which direction to move in, in order to reach the maximum, and if we reverse that process then it can also tell us which the minimum is. Using gradient to move towards the minimum is called gradient descent. We want to reach the minimum value of the function because the function represents the error between the true prediction and the data prediction, and we want that to be minimum.

Convexity and smoothness constraint

A convex function is something that has a pit. The following function is convex:

It’s computationally much easier to find the minimum value of a function, if it has a pit. Therefore, the convexity constraint is essential in real-life optimization. 

A smooth convex function looks like a bowl in 3 dimensions and a curve in 2 dimensions (see figure below). 

Why a smooth convex function and not a sharp convex function? Even a sharp convex function has a minimum (a pit). So why make it smooth? Well, let’s look, again, at the sharp convex function.

If you put a ball into that function, then it will roll down the slope so quickly that it would ride up the other side, instead of stopping at the minimum (the pit). Again, after riding up the other side, it would roll back down with so much momentum that it rides up the original side. This goes on and on, and therefore the minimum is never actually attained. The ball never actually stops at the pit. It keeps riding up and down the sides.

Now, look at the smooth convex function.

If you drop a ball into this function, the momentum of its descent will be slowled by the curve. Thus, there will be a lot less riding up and down the sides before the ball settles into the pit.

This is an analogy. It gives you an idea as to why gradient descent is easier for a smooth convex function when compared to a sharp convex function. Since we rely on gradient descent to point us towards the minimum, we don’t want to overload the computer with too many ride-up-and-down steps, which means we want to make the function smooth. 

Notice that F(p) = Ȳ – Y is not a smooth function.
On, the other hand, F(p) = (Ȳ – Y)² is a smooth function. Therefore, this becomes the final objective function. 

We’ll stop here, for now. Essentially, we talked the objective function and two key constraints — convexity and smoothness. In the next one, we’ll cover more constraints which aim to further simplify the computational complications in error minimization.