Gradient Descent
Gradient descent is an iterative algorithm to determine the minimum of any function, let's say f(x). It is mostly used in machine learning algorithms in the process of minimization of the cost function in the linear regression technique.
For uni-variate function ;
This value is iterated until we get the stable value for x where α determines the converging rate.
Now how does it help to determine the minima of a function and what advantage can we take from it?
Suppose a function,
Graphically, the function can be represented as
Let's now determine a tangent at any point of the equation.
As the two points (x,y) and (x′,y′) become closer and closer to each other, the difference between them tends to zero. The slope of the line can be derived as:
The slope of the equation at any point is either negative or positive or zero. α is the learning rate always positive. The iterative function,
chooses the new value for x, so that it is left to the old value of x when the slope is positive, to the right when the slope is negative and remains the same when the slope is 0, hence gradually taking us to the minimum of the function depending upon the learning rate as the iteration continues and hence determining the value of x where y can be minimum.
How can we use this algorithm for our advantage or more specifically to estimate a linear model which can predict the future values of any scattered data?
Let's suppose a graph of data with a single, or two variable or can be more.
Let the red line be h(x). If h(x) is the best function that fits into the whole data point, then the error that it generates while calculating the values for the y must be minimum.
Error is the distance between the actual value y and the value we estimated from our heuristic function h(x).
To give more importance to errors in specific data points, we can square the error term. The total sum of square of errors is:
The cost function can be more generalized to
Now we need to determine the value for x1,x2,... such that the value of cost function is minimum, and we know how to get it. (By Gradient Descent Algorithm)
repeat until convergence {
}
We should be careful while selecting the value of alpha. If α is much smaller, the algorithm converges slowly and if α is very big, the algorithm may not converge or in some cases may diverge.