How Does Gradient Descent Actually Work: Gradient Descent vs Least Squares

How does Gradient Descent actually work? At a lower level, what’s happening? I was brushing over the details of Gradient Descent without really understanding it/how it differed from Least Squares.

If you are having similar thoughts, I suggest this awesome video.

To summarize:

Say our data points (a=weight, b=height) are:

(0.5, 1.4)

(2.3, 1.9)

(2.9, 3.2)

Say we want:

Model: Linear regression
Cost function: SSE

This looks like:

Sum of Squared Residuals 
= Σ(Actual - Predicted)²
= (1.4 - (intercept + slope x 0.5))² + (1.9 - (intercept + slope x 2.3))² + (3.2 - (intercept + slope x 2.9))²

Imagine a 3D graph where the x is slope, the y is intercept, and the z is SSE. Our job is to find the (x,y) coordinates on this graph where z is at a minimum. Least Squares and Gradient descent are simply two methods we can use to find this answer. How do these two methods differ? Both Least Squares and Gradient Descent solves this problem by taking the derivative of this equation with respect to the intercept and with respect to the slope. This results in 2 separate derivative equations. The difference is:

Least Squares sets these derivatives = 0 and solves for intercept and slope. This simply finds the minimum of our Cost Function (SSE) explicitly. Finding this minimum explicitly is not always feasible for highly complex functions.

Gradient Descent starts by selecting random values for slope and intercept. It plugs the random values into the 2 derivative equations (and multiplies answer by Learning Rate) to obtain step sizes. It then steps in the direction of this gradient, which is the (x,y) direction we need to move so as to most reduce (z). In other words, the gradient tells us which nudges, or step sizes, we should make to our trainable parameters (slope, intercept) so as to most drastically reduce our Cost Function. Gradient Descent then uses these newly calculated values of (x,y) and repeats the process (repeatedly calculating new step sizes) until it finds local min of the Cost Function. Keep in mind, this example covered a simple linear regression with only two parameters to train (x: slope, y: intercept), but often we need to find the minimum of a Cost Function with many more parameters than two, such as in a Neural Network.