Deep Learning - The Mathematics Behind Neural Network - Part 2
11/07/2024
Note: This blog is part of a learn-along series, so there may be updates and changes as we progress.
In the previous blog, we covered the foundational concepts of neural networks. In this post, we learn the mathematics behind a basic neural network structure as illustrated below:
Introduction to Neural Network Structure
- Input Nodes: n1 and n2
- Hidden Nodes: n3 to n8
- Output Node: n9
Each node is associated with a bias, denoted as bi, and each synapse (connection between nodes) has a weight, denoted as wi. The initial input values are x1 and x2, and y^ represents the target output value.
First, let’s give the inputs, weights and biases an initial value to better visualize what is happening.
x∗1=0.23,x2=0.55
w1=0.1,w2=0.2,w3=0.3,w4=0.4,w5=0.5,w6=0.6,w7=0.7,w8=0.8,w9=0.9
w∗10=0.1,w∗11=0.2,w∗12=0.3,w∗13=0.4,w∗14=0.5,w∗15=0.6,w∗16=0.7,w∗17=0.8,w∗18=0.9
b1=0.1,b2=−0.4,b3=0.3,b4=−0.5,b5=0.5,b6=0.6,b7=−0.7
Forward Propagation Through Hidden Layers
Now, let’s understand how the input values works through the neural network. Looking at the node n∗3 which is in the hidden layer, we can see that it receives two inputs through the two synapses and as we discussed in the previous blog, the node usually does two things with the input. It finds its total net input and applies an activation function to the total net input to get the output of node n3. Following some common practices we will use [ReLu](https://en.wikipedia.org/wiki/Rectifier*(neural_networks)) which is an activation function for the node(s) of the hidden layers and Sigmoid for the node(s) of the output layer.
netn3=w1⋅x1+w4⋅x2+b1
netn3=0.1⋅0.23+0.4⋅0.55+0.1=0.343
out∗n3=max(0,net∗n3)
outn3=max(0,0.343)=0.343
Here is the output for the rest of the nodes in the hidden layer 1:
outn4=max(0,−0.079)=0
outn5=max(0,0.699)=0.699
Now, the output of the nodes in the hidden layer 1 becomes the input of the nodes in the hidden layer 2 as shown in the diagram below.
And after repeating the same steps to find the output of the nodes we get the following outputs.
outn6=max(0,0.0197)=0.0197
outn7=max(0,1.1239)=1.1239
outn8=max(0,1.3281)=1.3281
Computing Output Layer Activation
As mentioned above, we will be using the Sigmoid function as the activation function for the nodes in the output layer, which in this case is only one node.
net∗n9=w∗16⋅out∗n6+w∗17⋅out∗n7+w∗18⋅outn8+b7
netn9=1.4082
σ(x)=1+e−x1
y^=out∗n9=σ(net∗n∗9)=1+e−net∗n91
y^=σ(1.4082)=1+e−1.40821=0.8035
## Calculating Total Error
Our next step is to calculate the total error of the neural network. This can be done using a variety of methods but we will be making the use of the Squared Error with a multiplier of $\frac{1}{2}$ so that the derivative we will be doing later on will be much cleaner.
E(y,y^)=21∑(y−y^)2
y^ represents the ideal output and y represents the actual output. And let’s assume that y=0.01 to continue with our explanation.
E(y,y^)=21(y−y^)2
E(0.01,0.8035)=21(0.01−0.8035)2=0.315
If we had more than one output, we would have to calculate the error for each output and sum them to get the total error. But since we have only one output we can say that Etotal=0.315
Backpropagation and Weight Updates
Next, we have to do the backwards pass also know as backpropagation which updates the weights and biases. This is done to bring the ideal output closer to the actual output, which also reduces the total error in the process. Let’s first try to update the weight w∗16. Before we update it, we must know how much a change in the weight w∗16 affects the total error Etotal.
∂w∗16∂E∗total
If we apply the chain rule to ∂w∗16∂E∗total we get:
∂w∗16∂E∗total=∂out∗n∗9∂E∗total ⋅ ∂net∗n∗9∂out∗n∗9 ⋅ ∂w∗16∂net∗n∗9
E∗total=21∑(y−y^)2=21(y−out∗n9)2
∂out∗n∗9∂E∗total=out∗n9−y=0.8035−0.01=0.7935
out∗n9=1+e−net∗n91
∂net∗n∗9∂out∗n9=out∗n∗9(1−out∗n9)=0.8035 ⋅ (1−0.8035)=0.1579
net∗n9=w∗16 ⋅ out∗n6+w∗17 ⋅ out∗n7+w∗18 ⋅ outn8+b7
∂w∗16∂net∗n9=outn6=0.0197
Combining these we get:
∂w∗16∂E∗total=(out∗n9−y) ⋅ out∗n∗9 ⋅ (1−out∗n∗9) ⋅ out∗n6
∂w∗16∂E∗total=(0.7935) ⋅ (0.1579) ⋅ 0.0197=0.0025
Now, we can update the weight w16 using the learning rate η:
w∗16new=w∗16old−η ⋅ ∂w∗16∂E∗total
Assuming a learning rate η=0.01:
w16new=0.7−0.01 ⋅ 0.0025=0.699975
We can repeat this process for w∗17 and w∗18:
w17new=0.8−0.01⋅0.0850=0.798592
w18new=0.9−0.01⋅0.1003=0.898336
Similarly, we update the biases. Let’s start with b7:
∂b∗7∂E∗total=∂out∗n∗9∂E∗total ⋅ ∂net∗n∗9∂out∗n∗9 ⋅ ∂b∗7∂net∗n∗9
net∗n9=w∗16 ⋅ out∗n6+w∗17 ⋅ out∗n7+w∗18 ⋅ outn8+b7
∂b7∂netn9=1
∂b7∂Etotal=(0.7935)⋅(0.1579)⋅1=0.1253
b∗7new=b7old−η⋅∂b7∂E∗total
b7new=−0.7−0.01⋅0.1253=−0.701253
Iterative Training and Error Reduction
To update the weights and biases in the hidden layers, we need to propagate the error backward from the output layer to the hidden layers. We’ll start by calculating the partial derivatives for the weights of the synapses going in the second hidden layer, and then move to the weights of the synapses going in the first hidden layer.
∂w∗7∂E∗total=∂out∗n∗9∂E∗total ⋅ ∂net∗n∗9∂out∗n∗9 ⋅ ∂out∗n∗6∂net∗n∗9 ⋅ ∂net∗n∗6∂out∗n∗6 ⋅ ∂w7∂net∗n6
∂out∗n9∂E∗total=0.7935
∂net∗n9∂out∗n9=0.1579
∂out∗n∗6∂net∗n9=w∗16=0.7
∂net∗n∗6∂out∗n6={1if net∗n6>0 0otherwise
∂net∗n∗6∂out∗n6=1 (since net∗n6=0.0197>0)
∂w∗7∂net∗n6=outn3=0.343
∂w∗7∂E∗total=0.7935 ⋅ 0.1579 ⋅ 0.7 ⋅ 1 ⋅ 0.343=0.03008
Now, update the weight w10 using the learning rate η=0.01:
w∗7new=w∗7old−η⋅∂w∗7∂E∗total
w7new=0.7−0.01 ⋅ 0.03008=0.699
Similarly, for w1 we apply the chain rule:
∂w∗1∂E∗total=∂out∗n∗9∂E∗total⋅∂net∗n∗9∂out∗n∗9⋅∂out∗n∗6∂net∗n∗9⋅∂net∗n∗6∂out∗n∗6⋅∂out∗n∗3∂net∗n∗6⋅∂net∗n∗3∂out∗n∗3⋅∂w∗1∂net∗n∗3
By iterating this process (training), the total error decreases, and the neural network improves its task performance.
In this post, we’ve covered the mathematics behind a basic neural network, focusing on how the inputs, weights, and biases interact to produce the final output. We’ve walked through the process of forward propagation, calculating the output of each node, and applied the backpropagation algorithm to update the weights and biases, reducing the total error.
Until next time, signing off.