[email protected]

Deep Learning - The Mathematics Behind Neural Network - Part 2

11/07/2024

Note: This blog is part of a learn-along series, so there may be updates and changes as we progress.

In the previous blog, we covered the foundational concepts of neural networks. In this post, we learn the mathematics behind a basic neural network structure as illustrated below:

Introduction to Neural Network Structure


Each node is associated with a bias, denoted as bib_i, and each synapse (connection between nodes) has a weight, denoted as wiw_i. The initial input values are x1x_1 and x2x_2, and y^\hat{y} represents the target output value.

Initializing Inputs, Weights, and Biases

First, let’s give the inputs, weights and biases an initial value to better visualize what is happening.


 x1=0.23,x2=0.55x*1 = 0.23, x_2 = 0.55  w1=0.1,w2=0.2,w3=0.3,w4=0.4,w5=0.5,w6=0.6,w7=0.7,w8=0.8,w9=0.9w_1 = 0.1, w_2 = 0.2, w_3 = 0.3, w_4 = 0.4, w_5 = 0.5, w_6 = 0.6, w_7 = 0.7, w_8 = 0.8, w_9 = 0.9  w10=0.1,w11=0.2,w12=0.3,w13=0.4,w14=0.5,w15=0.6,w16=0.7,w17=0.8,w18=0.9w*{10} = 0.1, w*{11} = 0.2, w*{12} = 0.3, w*{13} = 0.4, w*{14} = 0.5, w*{15} = 0.6, w*{16} = 0.7, w*{17} = 0.8, w*{18} = 0.9  b1=0.1,b2=0.4,b3=0.3,b4=0.5,b5=0.5,b6=0.6,b7=0.7b_1 = 0.1, b_2 = -0.4, b_3 = 0.3, b_4 = -0.5, b_5 = 0.5, b_6 = 0.6, b_7 = -0.7

Forward Propagation Through Hidden Layers

Now, let’s understand how the input values works through the neural network. Looking at the node n3n*3 which is in the hidden layer, we can see that it receives two inputs through the two synapses and as we discussed in the previous blog, the node usually does two things with the input. It finds its total net input and applies an activation function to the total net input to get the output of node n3n_3. Following some common practices we will use [ReLu](https://en.wikipedia.org/wiki/Rectifier*(neural_networks)) which is an activation function for the node(s) of the hidden layers and Sigmoid for the node(s) of the output layer.

netn3=w1x1+w4x2+b1net_{n_3} = w_1 \cdot x_1 + w_4 \cdot x_2 + b_1

netn3=0.10.23+0.40.55+0.1=0.343net_{n_3} = 0.1 \cdot 0.23 + 0.4 \cdot 0.55 + 0.1 = 0.343

outn3=max(0,netn3)out*{n_3} = max(0, net*{n_3})

outn3=max(0,0.343)=0.343out_{n_3} = max(0, 0.343) = 0.343

Here is the output for the rest of the nodes in the hidden layer 1:

outn4=max(0,0.079)=0out_{n_4} = max(0, -0.079) = 0

outn5=max(0,0.699)=0.699out_{n_5} = max(0, 0.699) = 0.699

Now, the output of the nodes in the hidden layer 1 becomes the input of the nodes in the hidden layer 2 as shown in the diagram below.


And after repeating the same steps to find the output of the nodes we get the following outputs.


outn6=max(0,0.0197)=0.0197out_{n_6} = max(0, 0.0197) = 0.0197

outn7=max(0,1.1239)=1.1239out_{n_7} = max(0, 1.1239) = 1.1239

outn8=max(0,1.3281)=1.3281out_{n_8} = max(0, 1.3281) = 1.3281

Computing Output Layer Activation

As mentioned above, we will be using the Sigmoid function as the activation function for the nodes in the output layer, which in this case is only one node.

netn9=w16outn6+w17outn7+w18outn8+b7net*{n_9} = w*{16} \cdot out*{n_6} + w*{17} \cdot out*{n_7} + w*{18} \cdot out_{n_8} + b_7

netn9=1.4082net_{n_9} = 1.4082

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

y^=outn9=σ(netn9)=11+enetn9\hat{y} = out*{n_9} = \sigma(net*{n*9}) = \frac{1}{1 + e^{-net*{n_9}}}

y^=σ(1.4082)=11+e1.4082=0.8035\hat{y} = \sigma(1.4082) = \frac{1}{1 + e^{-1.4082}} = 0.8035


## Calculating Total Error Our next step is to calculate the total error of the neural network. This can be done using a variety of methods but we will be making the use of the Squared Error with a multiplier of $\frac{1}{2}$ so that the derivative we will be doing later on will be much cleaner.

E(y,y^)=12(yy^)2E(y, \hat{y}) = \frac{1}{2}\sum (y - \hat{y})^2

y^\hat{y} represents the ideal output and yy represents the actual output. And let’s assume that y=0.01y = 0.01 to continue with our explanation.

E(y,y^)=12(yy^)2E(y, \hat{y}) = \frac{1}{2} (y - \hat{y})^2

E(0.01,0.8035)=12(0.010.8035)2=0.315E(0.01, 0.8035) = \frac{1}{2} (0.01 - 0.8035)^2 = 0.315

If we had more than one output, we would have to calculate the error for each output and sum them to get the total error. But since we have only one output we can say that Etotal=0.315E_{total} = 0.315

Backpropagation and Weight Updates

Next, we have to do the backwards pass also know as backpropagation which updates the weights and biases. This is done to bring the ideal output closer to the actual output, which also reduces the total error in the process. Let’s first try to update the weight w16w*{16}. Before we update it, we must know how much a change in the weight w16w*{16} affects the total error EtotalE_{total}.

Etotalw16\frac{\partial E*{\text{total}}}{\partial w*{16}}

If we apply the chain rule to Etotalw16\frac{\partial E*{\text{total}}}{\partial w*{16}} we get:

Etotalw16=Etotaloutn9  outn9netn9  netn9w16\frac{\partial E*{total}}{\partial w*{16}} = \frac{\partial E*{\text{total}}}{\partial out*{n*9}}  \cdot  \frac{\partial out*{n*9}}{\partial net*{n*9}}  \cdot  \frac{\partial net*{n*9}}{\partial w*{16}}

Etotal=12(yy^)2=12(youtn9)2E*{total} = \frac{1}{2}\sum (y - \hat{y})^2 = \frac{1}{2} (y - out*{n_9})^2

Etotaloutn9=outn9y=0.80350.01=0.7935\frac{\partial E*{\text{total}}}{\partial out*{n*9}} = out*{n_9} - y = 0.8035 - 0.01 = 0.7935

outn9=11+enetn9out*{n_9} = \frac{1}{1 + e^{-net*{n_9}}}

outn9netn9=outn9(1outn9)=0.8035  (10.8035)=0.1579\frac{\partial out*{n_9}}{\partial net*{n*9}} = out*{n*9}(1 - out*{n_9}) = 0.8035  \cdot  (1 - 0.8035) = 0.1579

netn9=w16  outn6+w17  outn7+w18  outn8+b7net*{n_9} = w*{16}  \cdot  out*{n_6} + w*{17}  \cdot  out*{n_7} + w*{18}  \cdot  out_{n_8} + b_7

netn9w16=outn6=0.0197\frac{\partial net*{n_9}}{\partial w*{16}} = out_{n_6} = 0.0197

Combining these we get:

Etotalw16=(outn9y)  outn9  (1outn9)  outn6\frac{\partial E*{\text{total}}}{\partial w*{16}} = (out*{n_9} - y)  \cdot  out*{n*9}  \cdot  (1 - out*{n*9})  \cdot  out*{n_6}

Etotalw16=(0.7935)  (0.1579)  0.0197=0.0025\frac{\partial E*{\text{total}}}{\partial w*{16}} = (0.7935)  \cdot  (0.1579)  \cdot  0.0197 = 0.0025

Now, we can update the weight w16w_{16} using the learning rate η\eta:

w16new=w16oldη  Etotalw16w*{16}^{\text{new}} = w*{16}^{\text{old}} - \eta  \cdot  \frac{\partial E*{\text{total}}}{\partial w*{16}}

Assuming a learning rate η=0.01\eta = 0.01:

w16new=0.70.01  0.0025=0.699975w_{16}^{\text{new}} = 0.7 - 0.01  \cdot  0.0025 = 0.699975

We can repeat this process for w17w*{17} and w18w*{18}:

w17new=0.80.010.0850=0.798592w_{17}^{\text{new}} = 0.8 - 0.01 \cdot 0.0850 = 0.798592

w18new=0.90.010.1003=0.898336w_{18}^{\text{new}} = 0.9 - 0.01 \cdot 0.1003 = 0.898336


Similarly, we update the biases. Let’s start with b7b_7:

Etotalb7=Etotaloutn9  outn9netn9  netn9b7\frac{\partial E*{total}}{\partial b*{7}} = \frac{\partial E*{\text{total}}}{\partial out*{n*9}}  \cdot  \frac{\partial out*{n*9}}{\partial net*{n*9}}  \cdot  \frac{\partial net*{n*9}}{\partial b*{7}}

netn9=w16  outn6+w17  outn7+w18  outn8+b7net*{n_9} = w*{16}  \cdot  out*{n_6} + w*{17}  \cdot  out*{n_7} + w*{18}  \cdot  out_{n_8} + b_7  

netn9b7=1\frac{\partial net_{n_9}}{\partial b_7} = 1

Etotalb7=(0.7935)(0.1579)1=0.1253\frac{\partial E_{\text{total}}}{\partial b_7} = (0.7935) \cdot (0.1579) \cdot 1 = 0.1253

b7new=b7oldηEtotalb7b*7^{\text{new}} = b_7^{\text{old}} - \eta \cdot \frac{\partial E*{\text{total}}}{\partial b_7}

b7new=0.70.010.1253=0.701253b_7^{\text{new}} = -0.7 - 0.01 \cdot 0.1253 = -0.701253

Iterative Training and Error Reduction

To update the weights and biases in the hidden layers, we need to propagate the error backward from the output layer to the hidden layers. We’ll start by calculating the partial derivatives for the weights of the synapses going in the second hidden layer, and then move to the weights of the synapses going in the first hidden layer.

Etotalw7=Etotaloutn9  outn9netn9  netn9outn6  outn6netn6  netn6w7\frac{\partial E*{total}}{\partial w*{7}} = \frac{\partial E*{\text{total}}}{\partial out*{n*9}}  \cdot  \frac{\partial out*{n*9}}{\partial net*{n*9}}  \cdot  \frac{\partial net*{n*9}}{\partial out*{n*6}}  \cdot  \frac{\partial out*{n*6}}{\partial net*{n*6}}  \cdot  \frac{\partial net*{n_6}}{\partial w_7}

Etotaloutn9=0.7935\frac{\partial E*{\text{total}}}{\partial out*{n_9}} = 0.7935

outn9netn9=0.1579\frac{\partial out*{n_9}}{\partial net*{n_9}} = 0.1579

netn9outn6=w16=0.7\frac{\partial net*{n_9}}{\partial out*{n*6}} = w*{16} = 0.7

outn6netn6={1if netn6>0 0otherwise\frac{\partial out*{n_6}}{\partial net*{n*6}} = \begin{cases} 1 & \text{if } net*{n_6} > 0 \ 0 & \text{otherwise} \end{cases} outn6netn6=1 (since netn6=0.0197>0)\frac{\partial out*{n_6}}{\partial net*{n*6}} = 1 \text{ (since } net*{n_6} = 0.0197 > 0)

netn6w7=outn3=0.343\frac{\partial net*{n_6}}{\partial w*{7}} = out_{n_3} = 0.343

Etotalw7=0.7935  0.1579  0.7  1  0.343=0.03008\frac{\partial E*{total}}{\partial w*{7}} = 0.7935  \cdot  0.1579  \cdot  0.7  \cdot  1  \cdot  0.343 = 0.03008

Now, update the weight w10w_{10} using the learning rate η=0.01\eta = 0.01:

w7new=w7oldηEtotalw7w*{7}^{\text{new}} = w*{7}^{\text{old}} - \eta \cdot \frac{\partial E*{\text{total}}}{\partial w*{7}}

w7new=0.70.01  0.03008=0.699w_{7}^{\text{new}} = 0.7 - 0.01  \cdot  0.03008 = 0.699

Similarly, for w1w_1 we apply the chain rule:

Etotalw1=Etotaloutn9outn9netn9netn9outn6outn6netn6netn6outn3outn3netn3netn3w1\frac{\partial E*{\text{total}}}{\partial w*{1}} = \frac{\partial E*{\text{total}}}{\partial out*{n*9}} \cdot \frac{\partial out*{n*9}}{\partial net*{n*9}} \cdot \frac{\partial net*{n*9}}{\partial out*{n*6}} \cdot \frac{\partial out*{n*6}}{\partial net*{n*6}} \cdot \frac{\partial net*{n*6}}{\partial out*{n*3}} \cdot \frac{\partial out*{n*3}}{\partial net*{n*3}} \cdot \frac{\partial net*{n*3}}{\partial w*{1}}

By iterating this process (training), the total error decreases, and the neural network improves its task performance.

In this post, we’ve covered the mathematics behind a basic neural network, focusing on how the inputs, weights, and biases interact to produce the final output. We’ve walked through the process of forward propagation, calculating the output of each node, and applied the backpropagation algorithm to update the weights and biases, reducing the total error.

Until next time, signing off.