Neural Networks: Cost Function

What is a Cost Function?

In simple terms, a cost function (or loss function) measures how "wrong" our model's predictions are compared to the actual correct answers. The goal of training a neural network is to find the set of parameters (weights and biases) that minimizes this cost function. The lower the cost, the better our model is performing.

Consider a Traning set which is like a tuple of input (x) and output (y) values. It is represented as (x,y). There are m traning examples. Then consider that the neural network has L layers. Sl represents the number of neurons.

There are 2 types of output:

The Cost Function for Logistic Regression

Before we jump into neural networks, let's quickly review the cost function for logistic regression, which is a single-layer neural network. The formula looks like this:

$J (θ) = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} lo g (h_{θ} (x^{(i)})) + (1 - y^{(i)}) lo g (1 - h_{θ} (x^{(i)}))] + \frac{λ}{2 m} \sum_{j = 1}^{n} θ_{j}^{2}$

Let's break this down:

$J (θ)$ is the cost function, which depends on our model's parameters, $θ$ .
$m$ is the number of training examples.
$y^{(i)}$ is the actual output for the $i$ -th training example.
$h_{θ} (x^{(i)})$ is the predicted output from our model for the $i$ -th training example.
The first part, the sum of the log terms, is the primary cost. It penalizes the model when its predictions are wrong.
The second part, $\frac{λ}{2 m} \sum_{j = 1}^{n} θ_{j}^{2}$ , is the regularization term. It helps prevent our model from overfitting the training data by penalizing large parameter values. $λ$ is the regularization parameter.

Extending to Neural Networks

A neural network is essentially a collection of interconnected logistic regression units. The cost function for a neural network is a generalization of the logistic regression cost function. The key difference is that a neural network can have multiple output neurons, especially for multi-class classification problems.

The image provided shows the cost function for a neural network with $K$ output units.

$J (Θ) = - \frac{1}{m} \sum_{i = 1}^{m} \sum_{k = 1}^{K} [y_{k}^{(i)} lo g ((h_{Θ} (x^{(i)}))_{k}) + (1 - y_{k}^{(i)}) lo g (1 - (h_{Θ} (x^{(i)}))_{k})] + \frac{λ}{2 m} \sum_{l = 1}^{L - 1} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l + 1}} (Θ_{ji}^{(l)})^{2}$

This looks a bit more intimidating, but it's built on the same principles. Let's break it down piece by piece:

The Main Cost Term:
- The double summation $\sum_{i = 1}^{m} \sum_{k = 1}^{K}$ means we are summing the cost over all training examples ( $m$ ) and all output neurons ( $K$ ).
- $y_{k}^{(i)}$ is the actual value for the $k$ -th output neuron for the $i$ -th example.
- $(h_{Θ} (x^{(i)}))_{k}$ is the predicted value from the $k$ -th output neuron for the $i$ -th example.
- The logic within the brackets is identical to the logistic regression cost. We're just applying it to each output neuron individually and summing the results.
The Regularization Term:
- This is where things get a little more complex because a neural network has many layers and many weights.
- $\sum_{l = 1}^{L - 1}$ sums over all the layers in the network (except the output layer, which doesn't have a weight matrix going into it). $L$ is the total number of layers.
- $\sum_{i = 1}^{s_{l}}$ sums over all the neurons in layer $l$ . $s_{l}$ is the number of units in layer $l$ .
- $\sum_{j = 1}^{s_{l + 1}}$ sums over all the neurons in layer $l + 1$ . $s_{l + 1}$ is the number of units in layer $l + 1$ .
- $(Θ_{ji}^{(l)})^{2}$ is the square of a single weight. Specifically, the weight connecting the $i$ -th unit of layer $l$ to the $j$ -th unit of layer $l + 1$ .
- The triple summation simply means we're summing the squares of all the weights in the entire network. We do not regularize the bias terms ( $Θ_{j 0}^{(l)}$ ) as they are not dependent on the input features.

In essence, the neural network cost function is a powerful tool that combines the principles of logistic regression with a comprehensive regularization scheme to handle the complexity of multi-layered networks. By minimizing this function, we're able to find the optimal weights that allow our network to make accurate predictions.

Machine Learning

Search This Blog