// classification problems

Artificial Neural Networks - Backpropagation

The single most important feature of Artificial Neural Networks that makes it a more efficient algorithm than others is its ability to minimize error through the use of Backpropagation. Artificial Neural Networks has been explored under the Modeling section, and in this article a mechanism known as the Backward Propagation of Errors with Gradient Descent is explored which makes ANN unique from other algorithms.

Derivatives

Backward propagation requires calculating error, which is done using an error function. For this, the gradient of the error function with respect to the weights of the neural network is calculated. The calculation of the gradient is performed from the back (from the output layer) to the front (the input layer). However, to understand how we derive the gradient, we should first understand how Gradient Descent works.

First, let’s start with an understanding of what we mean by a derivative. There are several ways of thinking about the derivative of a function, and the simplest way of describing the derivative of a function is when we imagine a function as a slope, where the derivative tells us how quickly the function is changing at a particular point.

Thus if we have a function f(x) at point x, then the derivative is simply the indication of whether the function will increase or decrease upon increasing x. In the image below we can see that at point x₁ the derivative is positive, as upon increasing x₁ the function will grow, while the derivative is negative at x₂, as upon increasing the value of x₂ the function will decrease.

An error function plotted on a 2-D graph, showing a curve with a minimum near zero; the tangent at point x1 has a positive slope while the tangent at point x2 has a negative slope

If we try to understand derivatives in the context of Backpropagation, then the above function is an error function. Here we need to understand derivatives in relation to the optimum value, where for the error function the optimum value is zero. Derivatives help us know whether our current value of x is lower or higher than the optimum value. Therefore, if we have x₁, the derivative is positive, which indicates that x₁ is bigger than the optimum value and should be decreased to reach the optimum value, while on the other hand, the derivative at x₂ tells us that it is smaller than the required value and should be increased to attain the optimum value.

In Artificial Neural Networks we discussed how a model made up of a single neuron acts as a logistic regression (classification problem) or linear regression (regression problem), which becomes evident when comparing the equation of an artificial neuron, Y = wx + b, with that of Linear Regression, Y = mx + b. Therefore, let’s first take an example of Linear Regression, where the slope provides us with Y′, the predicted values. We can calculate the error by subtracting the actual value Y from the output Y′, i.e. the error is Y - Y′. This error can be plotted on a line graph, and this line can be described as an error function.

The cost function, aka loss function, provides us with all the error that our model has generated. The formula of the cost function is:

Cost function formula: the sum from i equals 1 to n of the squared difference between Y sub i and Y prime sub i

This equation can also be written by expanding Y′ as:

Cost function formula expanded: the sum from i equals 1 to n of the squared difference between Y sub i and the quantity m x plus b, all subscript i

Our objective is to minimize this cost function, and this can be done by manipulating the values of m and b, as everything else is a constant. Our cost function is basically Y = x² (here Y is our loss and x² is our error), i.e. f(x) = x², and as shown above, when plotted on a graph, we can take a point, e.g. x₁, and find which direction to go to find the optimum value producing less error.

This decision of choosing the direction is taken through derivatives, where we draw a tangent line to the slope of the curve and calculate the derivative, and based on the value of the derivative we can decide the direction in which we need to move. However, we also have something known as the learning rate, which intuitively means how big a step we want to take in that direction in order to reach the optimum value. This whole process of using derivatives to minimize the cost function is known as gradient descent.

To minimize the error we need to find the optimum values of m and b, and need to find how the error changes with the change in m and b respectively. Thus our optimum value of m is m = m + Δm (here Δm, delta m, means the change in m), while the optimum value of b is b = b + Δb (here Δb, delta b, means the change in b).

As finding the derivative tells us which direction to move in order to minimize the error, we decide to find the derivative of the cost function. To do so, we simplify the cost function by removing the summation, as we have two ways of performing gradient descent: Batch Gradient Descent and Stochastic Gradient Descent. In the Batch method we consider all the inputs in a single go, while in Stochastic we compute the gradient using a single sample at a time. As we go for the stochastic approach here, we remove the summation and consider each error at a time. We also write Y - (mx + b) simply as ‘error’, which makes our cost function look like:

Cost function J of m and b equals Error squared

Here our cost J, which is a function of m and b, is equal to the square of the error. Now, to find the derivative of J relative to m, which basically means finding how J changes when m changes, we need to be clear on two more concepts related to calculus - the Power Rule and the Chain Rule.

Power Rule

If we have a function f(x) and it is equal to some power of x, i.e. xⁿ, where n is not equal to 0, then the derivative of f(x) will simply be nx^n-1. So if our function is f(x) = x², then the derivative of f(x) will be 2x^2-1, which is 2x¹, which can simply be written as 2x. Similarly, when f(x) = x⁴, the derivative of f(x) will be 4x^4-1, which equals 4x³.

Chain Rule

If we have a function x which is equal to y², i.e. x = y², and a function y which is equal to z², i.e. y = z², then as x depends on y and y depends on z, if we want to find the derivative of x relative to z we can find the derivative of x relative to y and then multiply it by the derivative of y relative to z. We can chain the derivatives together in order to find the derivative when they are not directly connected but are indirectly linked. Here, if we apply the chain rule, the derivative of x relative to y will be 2y, while the derivative of y relative to z will be 2z, making the derivative of x relative to z equal to 2y × 2z. We can also put this another way: if f grows a certain number of times faster than g (e.g. grows twice as fast as g), and g grows a certain number of times as fast as x (e.g. g grows twice as fast as x), then we can know how quickly f grows (here f grows 2 × 2 times as fast as x).

Backward Propagation Equation

Now coming back to:

If we apply the chain rule, we know that for f(x) = x², the derivative of x will be 2x. As we are trying to find the derivative of the cost function, which can simply be written as Error², then as per the power rule, the derivative of the cost function is 2×Error.

Partial derivative of J with respect to m equals 2 times Error

Now we find the derivative of the cost function with respect to the error. However, we cannot simply stop here - we require the chain rule. We know that J is a function of Error², while Error² is a function of m and b. As per the chain rule, this makes the equation:

Partial derivative of J with respect to m equals partial derivative of J with respect to Error, multiplied by partial derivative of Error with respect to m

As we are looking for Δm, we apply the chain rule and try to find the derivative of the cost function relative to m, and for that we multiply the derivative of J relative to the error with the derivative of Error relative to m. To calculate the derivative of Error relative to m, we find the partial derivative, as Error is relative to both m and b, and as we are only focusing on m, we calculate partial derivatives.

We know that Error = Y - (mx + b), which can be written as Y - xm − b, and as we are computing the partial derivative, apart from m everything is a constant, which means x, b and Y are constant. By applying the power rule we can say that the derivative is 1 × x × m⁰, and as constants don’t change, their derivative is 0, which brings the derivative to simply the value of x.

Thus the derivative of J relative to m is 2 × error × x. We then multiply this by the learning rate to decide how far we want to go in the direction of the optimum value provided by the derivative, making the equation (2 × error × x) × learning rate. As our error and x will be multiplied by the learning rate, we can get rid of the 2 (removing the 2 and instead having 2 as part of the learning rate produces similar results and gives us more control over the process), making the final equation error × x × learning rate.

Similarly, the derivative of J relative to b is the derivative of J relative to the error multiplied by the partial derivative of Error relative to b, and as in the equation Y - mx − b, m, x and Y are constant, as per the power rule the value of the derivative comes out to be 1 (1 × b⁰ = 1). Thus the derivative of J relative to b comes out to be error × 1 × learning rate, i.e. error × learning rate.

Thus m = m + Δm, where Δm is error × input × learning rate, while b = b + Δb, where Δb is error × learning rate.

Understanding Backpropagation with an Example

To understand backpropagation from scratch, we will start with a three-layer model where we have one input (x₁), one hidden neuron h₁, and one output y₁. If we concentrate only on the weights between the hidden layer and the output layer, then we have one connection from h₁ to y₁ with weight wh₁y₁, which has a value of 0.1.

Network diagram with one input X1, one hidden neuron h1, and one output y1, connected by weight wh1y1 equal to 0.1

For a moment, presume that we had one input, which was used by the h₁ neuron to generate a result that was passed to the y₁ neuron, and after passing through the activation function gave an output of 0.6. As we are working in a supervised environment, we also have the correct labels to compare the result with. As per the correct label, the result should have been 1, which means we have an error of 0.4 (1 - 0.6). We can use this 0.4 (e₁) to adjust the weight wh₁y₁ in order to produce a better result (i.e. less error). Here, as the output is less than expected, we should increase the weight in order to produce an output closer to 1 (note that we are not considering the influence of bias for now). Thus we increase the weight in the direction of the error in order to minimize the error.

Now, if we increase the number of neurons in the hidden layer by one and introduce another neuron h₂, we will have another weight wh₂y₁, having a value of 0.2, and these two weights (wh₁y₁ and wh₂y₁) will be connected to the output neuron y₁.

Network diagram with one input X1, two hidden neurons h1 and h2, and one output y1, connected by weights wh1y1 equal to 0.1 and wh2y1 equal to 0.2

We will now be required to adjust two weights in order to minimize the error produced by y₁; however, we don’t know which weight is responsible for the error or to what extent each weight is participating in producing the error. For this, we consider the size of the weight, which means the higher the weight, the more responsible it is for the error. Therefore, to adjust the weight wh₁y₁, we use the formula (wh₁y₁ / (wh₁y₁ + wh₂y₁)) × e₁, and for wh₂y₁ we use the formula (wh₂y₁ / (wh₁y₁ + wh₂y₁)) × e₁, which basically means that as wh₂y₁ has the higher weight (0.2), it is more responsible for the error - roughly two-thirds - while wh₁y₁ shares one-third of the responsibility for causing the error. Thus we will tweak the weight wh₁y₁ by 33% while the weight wh₂y₁ is adjusted by 67%. We adjust the ‘delta weights’ proportionately, based on the value of their weights, in order to minimize the error (e₁).

So far we have concentrated only on the second and third layers. If we add two neurons in the input layer, we create a structure similar to the multilayer perceptrons discussed under Artificial Neural Networks.

Network diagram with two inputs X1 and X2, two hidden neurons h1 and h2, and one output y1, with four weights connecting the input and hidden layers

Now we have two inputs, x₁ and x₂. Thus we have four weights connecting the input layer to the hidden layer (wx₁h₁, wx₁h₂, wx₂h₁, wx₂h₂). It is important to understand that the error on y₁ can be manipulated by tweaking the weights connecting the hidden and output layer; however, the neurons in the hidden layer are also directly affected by the neurons in the input layer, as a set of weights connects the input and hidden layers. Therefore, tweaking these weights changes the values of the hidden neurons, which in turn affects the output and the error. We therefore need to know in which direction, and to what magnitude, the weights of the input and hidden layer should be tweaked.

So far we tuned the weights of the hidden-output layer based on the error found at the output layer, and we could do so because they were directly connected, so we had an error we could calculate and assign responsibility for to different weights in order to adjust them. However, the weights of the input-hidden layer are not directly connected to the output layer, and thus we need to calculate the error of the neurons in the hidden layer in order to tweak the weights of the input-hidden layer.

Thus we see a backward mechanism at work, where the error at the output layer is used to tweak the weights of the hidden layer, while the errors at the hidden layer are calculated to adjust the weights of the input layer.

To adjust the four weights of the input-hidden layer (wx₁h₁, wx₁h₂, wx₂h₁, wx₂h₂), we need to calculate the errors on h₁ and h₂, which we have already done above, according to which the error on h₁ is 33% while the error on h₂ is 67%.

If we consider the model structure discussed under multilayer perceptrons and make it more complex by adding two neurons in the output unit, where y₁ predicts the value 1 while y₂ predicts the value 0 (a binary classification problem), we can see how complex backpropagation can become, as we now have two neurons in the output layer, y₁ and y₂, with y₁ producing an output of 0.6, causing an error e₁ of 0.4 (1 - 0.6), while y₂ produces an output of 0.3, causing error e₂ to stand at −0.3 (0 - 0.3). The number of weights has now increased from two (wh₁y₁, wh₂y₁) to four (wh₁y₁, wh₂y₁, wh₁y₂, wh₂y₂), and while the weights wh₁y₁ and wh₂y₁ should be increased to minimize e₁, the weights wh₁y₂ and wh₂y₂ need to be decreased to minimize the error e₂.

Network diagram with two inputs, two hidden neurons h1 and h2, and two outputs y1 and y2, where y1 produces output 0.6 and y2 produces output 0.3

Therefore, the formula for calculating the error on h₁ and h₂ changes a bit, as we now have more weights (due to the increase in the number of neurons in the output layer) connecting the hidden and output layers:

Error h₁ = ((wh₁y₁ / (wh₁y₁ + wh₂y₁)) × e₁) + ((wh₁y₂ / (wh₁y₂ + wh₂y₂)) × e₂)

Error h₂ = ((wh₂y₁ / (wh₁y₁ + wh₂y₁)) × e₁) + ((wh₂y₂ / (wh₁y₂ + wh₂y₂)) × e₂)

The responsibility is thus taken on by each neuron in the hidden layer, as the error is calculated for each of these neurons by computing a proportion of the error with respect to their weights. Now that we have the errors of the hidden layer neurons, through gradient descent we can tweak the weights of the input-hidden layer, since we now have the required errors to perform the necessary calculation. Another way of looking at the above formula is that the denominator helps normalise these weights so they add up to 100%; however, as we will be multiplying all of this by a learning rate, we can get rid of the denominators, and as we are multiplying the error by the weight, the outputs will still be proportional. Thus the equations can be written as:

Error h₁ = (wh₁y₁ × e₁) + (wh₁y₂ × e₂)

Error h₂ = (wh₂y₁ × e₁) + (wh₂y₂ × e₂)

Updating the Weights and Biases

We can now start calculating the gradients. We know that if we look at each neuron individually, it has an equation similar to that of a linear regression, and above we calculated the derivatives for m and b. Drawing a parallel between linear regression and a single-neuron network, m can be written as w (weight) in the context of neural networks, so we will be finding Δw, while b (the bias term) remains b. However, we need to update the formulas, as we are now in a multidimensional scenario, and unlike linear regression, where Y = mx + b, here we have an activation function, sigmoid, which makes the equation y = σ(wx + b). We should also remember that we are now dealing with matrices, as we are in a multidimensional problem.

It is important to note at this stage that the sigmoid function is:

Sigmoid function: sigma of z equals the quantity 1 plus e to the negative z, raised to the power of negative 1

While the derivative of the sigmoid function is:

Derivative of the sigmoid function, derived step by step, resulting in sigma of x times the quantity one minus sigma of x

(which is basically the sigmoid times one minus the sigmoid).

Therefore, if we are looking to update the weights connecting the hidden layer and the output layer (wh₁y₁, wh₂y₁, wh₁y₂, wh₂y₂), the change in these weights can be represented by:

Δw_{hidden→output} = learning rate × Error × H (the input provided by h₁ and h₂ to the output neurons) × the derivative of the output, i.e. σ(x)(1 - σ(x)).

Here we take the derivative of the output, as we have an activation function (in our example, sigmoid), and unlike linear regression, where there was no such thing, here we also need to calculate the gradient of it, which is the learning rate multiplied by the error of the output, multiplied by the derivative of the output. As we can presume the output has passed through the sigmoid, we can rewrite the equation as:

Δw_{hidden→output} = Learning Rate × Error × (output(1 - output)) . H^T

(We write H as H^T because we transpose the input H, as it is a single-row matrix, and we need to transpose it before taking the dot product.)

We can similarly update the weights connecting the input and hidden layer (wx₁h₁, wx₁h₂, wx₂h₁, wx₂h₂), and here the errors will be the hidden layer errors (Error h₁ and Error h₂), while the inputs will be x₁ and x₂, and H (which was the input in the formula above). Thus the equation for updating these weights is:

Δw_{input→hidden} = Learning Rate × Hidden Layer Error × (H(1 - H)) . X^T

We now move on to updating the biases, and for this, as mentioned earlier, we do not consider the input. Thus we get the following equations for updating the biases:

Δb_{hidden→output} = Learning Rate × Error × (output(1 - output))

Δb_{input→hidden} = Learning Rate × Hidden Layer Error × (H(1 - H))

Network diagram with two inputs, two hidden neurons, two outputs, and bias nodes feeding into both the hidden layer and the output layer

Backward propagation is among the aspects of Artificial Neural Networks that makes it distinct from other learning algorithms. Backward Propagation takes the error value and computes the partial derivative with respect to the weight and bias in each layer. This is done from the output layer to the input layer, i.e. recursively, and through this we are able to update the weights and biases and again run forward with the updated weights and biases, compute the error, and repeat the process until the minimum error is found. Thus we update the weights and biases through backpropagation in order to decrease the error in the output.