A Deep Dive

Logistic regression is a popular classification algorithm, often used when we need to predict a binary outcome — for example, whether an email is spam (1) or not spam (0).

The goal of logistic regression is to find model parameters (weights) that produce probabilities close to the true labels. But how do we measure how "good" our parameters are? That’s where the cost function comes in.

Why Not Use Mean Squared Error (MSE)?

If we tried to use the Mean Squared Error from linear regression for classification, we’d run into trouble:

Logistic regression uses the sigmoid function to map a linear score to a probability.
The sigmoid is non-linear, and MSE would produce a non-convex cost surface for logistic regression.
A non-convex cost surface means gradient descent could get stuck in local minima, making optimization harder.

We need a cost function that is convex so gradient descent works reliably.

From Likelihood to Cost

Instead of minimizing MSE, logistic regression uses Maximum Likelihood Estimation (MLE).

Our model outputs the probability:

\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}

where

z = w^T x + b

For a given training example $(x^{(i)}, y^{(i)})$ , where $y \in \{0, 1\}$ :

If $y = 1$ , we want $\hat{y}$ to be close to 1.
If $y = 0$ , we want $\hat{y}$ to be close to 0.

The probability of the correct class can be written as:

P(y^{(i)} | x^{(i)}) = (\hat{y}^{(i)})^{y^{(i)}} \cdot (1 - \hat{y}^{(i)})^{(1 - y^{(i)})}

This works because:

If $y = 1$ , the term becomes $\hat{y}$
If $y = 0$ , the term becomes $1 - \hat{y}$

The Likelihood Function

For $m$ training examples, the likelihood is:

L(w, b) = \prod_{i=1}^m (\hat{y}^{(i)})^{y^{(i)}} (1 - \hat{y}^{(i)})^{(1 - y^{(i)})}

Maximizing this likelihood means finding parameters $w, b$ that make the data most probable.

Log-Likelihood

Multiplying many small probabilities can cause numerical underflow. To make computation easier, we take the logarithm:

\ell(w, b) = \sum_{i=1}^m \left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right]

From Maximization to Minimization

Optimization algorithms like gradient descent usually minimize rather than maximize. So we take the negative of the log-likelihood:

J(w, b) = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right]

This is the logistic regression cost function — also called binary cross-entropy loss.

Why It Works

The cost is 0 when predictions are perfect ( $\hat{y}=1$ for $y=1$ , $\hat{y}=0$ for $y=0$ ).
The cost increases sharply as predictions move away from the truth.
It’s convex, so gradient descent won’t get stuck in local minima.

Key Takeaway:
The logistic regression cost function isn’t arbitrary — it comes directly from the principles of probability and maximum likelihood estimation, and it’s designed to ensure a convex optimization problem.

Machine Learning

Search This Blog

Logistic Regression Cost Function