Gradient Descent

 

Breaking Down Gradient Descent 

If Machine Learning was a living organism, gradient descent would be its pulse. This post is for anyone prepping for interviews or just getting serious about training models. My goal here is to internalize the update rule, the intuition behind it, and the math all in one glance.


What Even Is Gradient Descent?

At the core of any learning algorithm is minimizing a cost function- how wrong your model is. Gradient descent is how you get there. You don’t jump to the answer, you take small, calculated steps in the right direction.

Gradient descent is an optimization algorithm that adjusts model parameters (like weights) to minimize the loss function.

 


The Rule We Live By

Here’s the key update step you’ll see in every derivation:

θⱼ := θⱼ - α * ∂J(θ)/∂θⱼ

Where:

  • θⱼ = model parameter (like a weight)

  • α = learning rate (controls step size)

  • J(θ) = cost/loss function

  • ∂J/∂θⱼ = gradient (slope at that point)

This rule says: take your current weight and nudge it in the opposite direction of the gradient. Why? Because the gradient tells you where the cost increases—the opposite direction is where it decreases.




Visualizing It

In your diagram:

  • The bowl-shaped curve is the cost function J(θ)

  • The dot is your current parameter state

  • The derivative (slope) points uphill, so you step downhill

  • You stop when you hit the valley—the minimum


Learning Rate (α): The Dealbreaker

Your step size matters:

  • Too small? You crawl and waste time.

  • Too big? You overshoot and may never converge.

Choosing α is like picking your pace on a hike. Go steady, adjust as needed.


Bonus Intuition

If the slope is:

  • Positive, subtracting α * slope reduces θ

  • Negative, subtracting makes θ larger—again moving downhill

It’s like rolling a ball down a hill, except you’re doing it one equation at a time.


This visual and breakdown helped me demystify all those math-heavy papers.

—Kavana





Comments