Sammelan Yogi

Gradient Descent: Determining Minima Since 1847

A shallow study of on old algorithm used to minimize error in a function

Sammelan Yogi
Gradient Descent: Determining Minima Since 1847

Gradient descent is an iterative algorithm to determine the minimum of any function. In machine learning problems, this algorithm can be used in the process of minimization of the cost function in the linear regression technique.

For uni-variate function f(x)

xi+1xiαJ(xi)xx_{i+1} \Rightarrow x_i - \alpha \frac{\partial J(x_i)}{\partial x}

The new value of x is iterated until we get the stable value for x and α determines the converging rate.

Now let's see how it helps to determine the minima of a function and the advantages we can take.

Suppose a function,

f(x)=x2f(x) = x^2

The function can be represented graphically as:

Quadratic Function

Let's now determine a tangent at any point of the equation.


As the two points (x, y) and (x', y') become closer and closer to each other, the difference between them tends to be zero. The slope of the line can be derived as:

dydx=yyxx=x2x2xx=x2(xδx)2x(xδx)x=xδx=x2x2+2xδxδx2δxδx2=0=2xδxδx=2x\begin{aligned} \frac{dy}{dx}&=\frac{y-y^{\prime}}{x-x^{\prime}} \\ &=\frac{x^2-x^{\prime2}}{x-x^{\prime}} \\ &=\frac{x^2-(x-\delta x)^2}{x-(x-\delta x)} \because x^{\prime} = x - \delta x \\ &=\frac{x^2 - x^2 + 2x \delta x - \delta x^2}{\delta x} \because \delta x^2=0\\ &=\frac{2x \delta x}{\delta x} \\ &=2x \end{aligned}

The slope of the equation at any point is either negative, positive, or zero. α is the learning rate which is always positive.

The iterative function,

xxαdf(x)dxx \Rightarrow x - \alpha \frac{df(x)}{dx}

chooses the new value for x, such that it: 

  • is left to the old value of x when the slope is positive, 

  • is right when the slope is negative 

  • and remains the same when the slope is 0

This process gradually takes us to the minimum of the function depending upon the learning rate with each iteration and continuously determining the value of x where y can be minimum.

Let's see how we can estimate a linear model which can predict the future values of any scattered data.

For that, we suppose a graph of data with any number of variables.

Scatter Plot

Let the red line be h(x). If h(x) is the best function that fits into the data points, then the error it generates while calculating the values for the y must be minimum.

Error is the distance between the actual value y and the value we estimated from our heuristic function h(x).

e=yh(x)e = | y - h(x) |

To give more importance to errors in specific data points, we can square the error term. The total sum of the square of errors is:

E=i=1m(yih(x))2E = \sum_{i=1}^m (y_i-h(x))^2

The cost function can be more generalized to

J(x1,xn)=minx1,xn12mi=1m[yih(x1,xn)] J(x_1,\dots x_n) = \min_{x_1,\dots x_n} \frac{1}{2m} \sum_{i=1}^m [y_i - h(x_1,\dots x_n)]

Now we need to determine the value for x1, x2, ... such that the value of the cost function is minimum, and we know how to get it. (ofc by Gradient Descent Algorithm)

repeat until convergence:

xi+1xiαJ(x1,x2...xn)xix_{i+1} \Rightarrow x_i - \alpha \frac{\partial J(x_1, x_2...x_n)}{\partial x_i}

We should be careful while selecting the value of α. If α is very big, the algorithm may not converge. In some cases, it may even diverge.

Learning rate demonstration

Next read

Made with React, Gatsby and DatoCMS by @smastrom

Contribute or star on GitHub