Math behind the Gradient Descent Algorithm

5 min readAug 6, 2022

Heads up: This piece is for folks with basic understanding around the ML setup and possess some knowledge of derivatives.

Why do we care about the Gradient descent algorithm?

As we know, supervised machine learning algorithms are described as learning a target function (f) that best maps input variables (x) to an output variable (Y).

Y = f(x)

Given the above underlying principle, a typical supervised machine learning setup has 5 active agents at work. An ML environment must have-

(i)Data: input variables(x) mapped to an output variable(Y).

(ii)Model: Her we assume there exists a true relation between x & Y, which can be approximated using a functional form(Y = f(x))

(iii)Parameter: Further, this function is governed using a parameter(w) to make approximate predications ( Y=f(x, w))

(iv)Learning algorithm: The above parameter(w) needs to be learnt from the data using an algorithm e.g. Gradient Descent

(v)loss/objective function: Any learning from the data is governed by errors, also called the loss/cost/objective function. e.g. MSE

In a setup as such, learning algorithm is often considered as the mainstay to make accurate predictions. And, Gradient descent algorithm(GDA) is one of the many algorithms to learn parameters from the data.

Therefore, let’s make an attempt at understanding the math behind how GDA works!

Gradient Descent Algorithm: An Explanation

Gradient descent algorithm is an optimization technique, that follows the negative gradient of an objective function in order to locate the minimum of a function. It is technically called the first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function. Functioning of GDA is represented can be visualized as below:

Source: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

Food for thought!

Why does the algorithm HAVE to follow the NEGATIVE gradient of the loss function?

To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient of the function at the current point. In other words, we move in the direction opposite to the gradient. Have you ever wondered why?

Let’s attempt understanding this mathematically!

First, let us setup notations and parameters.

(i) Let randomly initialized vector of parameters be given as:

(ii) Let change in the values of [w, b] be given as:

(iii) Now, we make a small movement(η) the direction of Δθ to get the new θ. This can be given as:

Source: Deep Learning IIT Ropar — Mitesh Khapra

This brings us to the most fundamental question. What is the right Δθ to use or what is an optimal direction of movement for reaching the minima?
Answer to this can be derived using the Taylor Series

Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point.

For ease of notation, let Δθ be u, then from Taylor series we have,

where,

L(θ +ηu) is the new loss function at a certain value of θ; L(θ) is the current loss function at a certain value of θ; η is the small step in the direction of Δθ; u is Δθ and ∇θ L(θ) is derivative of the gradient w.r.t. its variables.

Now, since η is very small the values after η² will be very close to zero. Hence we can ignore them. The above equation thus becomes:

Note, that the new loss is less than the previous loss only if,

This implies,

We know that dot product of two vectors is the cosine of the angle between the vectors. That is a.b = |a||b|cos(θ).

Where β is the angle between u and the gradient. We know that the value of cos θ lies between -1 and 1. Thus, using this for above equation we get

Assuming the denominator is equal to k, we get the below condition:

Now, we want the difference in loss function to be as negative as possible. Since k is positive, the difference in loss function will be negative when cos(β) will be negative. The value of k*cos(β) will be minimum when β will be 180 degrees. Thus, the angle between u and the gradient should be 180 degree. Meaning, we should move in the direction 180 degrees with the gradient or opposite to the gradient for maximum loss reduction.

Gradient Descent Rule and the Parameter Update Equation

If you are at a certain value of θ[w, b] and want to move to a newθ[w, b]such that the new loss is less than the current loss, Gradient Descent Algorithm tells us we should move in a direction opposite to the gradient.

Aforementioned rule can be given as:

PS: If you’ve read the piece till here, then thank you for your patience and hope you’ve been able to take back something of value on GDA. In case you need a further drill down on equations or the concept, leave a message below or watch the detailed YouTube video(link mentioned in under references)

References

a) https://en.wikipedia.org/wiki/Gradient_descent

b) https://www.youtube.com/watch?v=giZD8yzXEZ4&list=PLEAYkSg4uSQ1r-2XrJ_GBzzS6I-f8yfRU&index=22&t=988s