## This is a supporting notebook for https://www.hirahim.com/posts/backpropagation-by-example/. Please see that post for more details.

Let's start by importing the PyTorch library, since we'll be using it to check our work.

In [1]:
import torch

Next, we'll define our activation function. For our example, we'll use the simple [ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html), since both its output and derivative are easy to work with by hand.

In [2]:
def activation_function(input):
    relu = torch.nn.ReLU()
    return relu(input)

Next, we'll define our loss function. Here, we'll use the [mean squared error](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss), since it's also easy to work with.

In [3]:
def loss_function(prediction, actual):
    mse_loss = torch.nn.MSELoss()
    return mse_loss(prediction, actual)

Finally, we'll establish our values. Normally, these would be random, but since we're working through it by hand, we'll want some fixed numbers to work through.

To get started, we'll be using the simplest possible neural network, one that has just one input neuron and one output neuron. (See [the post](https://www.hirahim.com/posts/backpropagation-by-example/) for a diagrams.)

In [4]:
inputs = torch.tensor([0.5])
actual = torch.tensor([0.95])

output_layer_weights = torch.tensor([0.3]).requires_grad_()
output_layer_bias = torch.tensor([0.4]).requires_grad_()

We'll do a "forward pass" and calculate the prediction by doing: `0.5*0.3+0.4` (input*weight + bias). This gives an output prediction of 0.55.

In [5]:
output_layer_out = activation_function(inputs@output_layer_weights + output_layer_bias)
print("Prediction: ", output_layer_out)

Prediction:  tensor([0.5500], grad_fn=<ReluBackward0>)


Next, we'll calculate our loss `(0.55 - 0.95)^2` (prediction - actual, squared). Giving us a loss of 0.16.

We'll also call `backward()` on the loss to calculate our gradients.

In [6]:
loss = loss_function(output_layer_out, actual)
loss.backward()
print("Loss: ", loss)

Loss:  tensor(0.1600, grad_fn=<MseLossBackward0>)


Let's do these by hand first.

The gradient for our weights is the partial derivative of the loss with respect to the input:

```
2(0.55 - 0.95) * 1 * 0.5 = -0.4
```

Here, `2(0.55 - 0.95)` is the derivative of the MSE; `1` is the derivative of the ReLU (ReLU'(0.55)); and `0.5` is the input value.

The gradient for our bias is:

```
2(0.55 - 0.95) * 1 * 1 = -0.8
```

Where once again `2(0.55 - 0.95)` is the derivative of the MSE and `1` is the derivative of the ReLU (ReLU'(0.55)). The final `1` is the derivative of the loss with respect to the basis.

In [7]:
print("Output layer weight gradient: ", output_layer_weights.grad)
print("Output layer bias gradient: ", output_layer_bias.grad) 

Output layer weight gradient:  tensor([-0.4000])
Output layer bias gradient:  tensor([-0.8000])


With those done and lining up with PyTorch, we can use them to update the weights and bias and do another forward pass to see if the loss is reduced (which it is: 0.16 -> 0.01)

In [8]:
learning_rate = 0.5

updated_output_layer_bias = output_layer_bias - learning_rate * output_layer_bias.grad.data
updated_output_layer_weights = output_layer_weights - learning_rate * output_layer_weights.grad.data
print("Updated weight: ", updated_output_layer_weights)
print("Updated bias: ", updated_output_layer_bias)

Updated weight:  tensor([0.5000], grad_fn=<SubBackward0>)
Updated bias:  tensor([0.8000], grad_fn=<SubBackward0>)


In [9]:
updated_output_layer_out = activation_function(inputs@updated_output_layer_weights + updated_output_layer_bias)
print("Updated prediction: ", updated_output_layer_out)

Updated prediction:  tensor([1.0500], grad_fn=<ReluBackward0>)


In [10]:
loss = loss_function(updated_output_layer_out, actual)
print("Updated loss: ", loss)

Updated loss:  tensor(0.0100, grad_fn=<MseLossBackward0>)


Let's do another example, this time with an extra hidden layer, but still with one neuron each:

In [11]:
inputs = torch.tensor([0.5])
actual = torch.tensor([0.95])

layer1_weights = torch.tensor([0.3]).requires_grad_()
layer1_bias = torch.tensor([0.4]).requires_grad_()

output_layer_weights = torch.tensor([0.2]).requires_grad_()
output_layer_bias = torch.tensor([0.1]).requires_grad_()

And do another forward pass:

In [12]:
layer1_out = activation_function(inputs@layer1_weights + layer1_bias)
print("Hidden layer: ", layer1_out)

output_layer_out = activation_function(layer1_out@output_layer_weights + output_layer_bias)
print("Prediction: ", output_layer_out)

Hidden layer:  tensor([0.5500], grad_fn=<ReluBackward0>)
Prediction:  tensor([0.2100], grad_fn=<ReluBackward0>)


Once again, we'll calculate the loss:

In [13]:
loss = loss_function(output_layer_out, actual)
loss.backward()
print("Loss ", loss)

Loss  tensor(0.5476, grad_fn=<MseLossBackward0>)


Then the gradients. Here, we go layer by layer, starting with the output layer:

```
# Output Layer Weights
2(0.21 - 0.95) * 1 * 0.55 = -0.814

# Output Layer Bias
2(0.21 -0.95) * 1 * 1 = -1.48
```

These should be fairly straight-forward since it's the same as the formula we saw before. For the hidden layer, it'll be:

```
# Hidden Layer Weights
2(0.21 - 0.95) * 0.2 * 1 * 0.5 = -0.148

# Hidden Layer Bias
2(0.21 = 0.95) * 0.2 * 1 * 1 = -0.296
```

For these we take the derivative of the MSE times the output layer's weight times the derivative of the ReLU (`1`) times the input layer's weight (or `1` if we're doing the bias).

In [14]:
print('Gradient for output layer weights: ', output_layer_weights.grad)
print('Gradient for output layer bias: ', output_layer_bias.grad)
print('Gradient for hidden layer weights: ', layer1_weights.grad)
print('Gradient for hidden layer bias: ', layer1_bias.grad)

Gradient for output layer weights:  tensor([-0.8140])
Gradient for output layer bias:  tensor([-1.4800])
Gradient for hidden layer weights:  tensor([-0.1480])
Gradient for hidden layer bias:  tensor([-0.2960])


Moving on, we'll create a network with two neurons in the hidden layer. See [the post](https://www.hirahim.com/posts/backpropagation-by-example/) for how the forumlas are worked out.

In [15]:
inputs = torch.tensor([0.5])
actual = torch.tensor([0.95])

layer1_weights = torch.tensor([[0.3, 0.2]]).requires_grad_()
layer1_bias = torch.tensor([0.4, 0.1]).requires_grad_()

output_layer_weights = torch.tensor([[0.1], [0.7]]).requires_grad_()
output_layer_bias = torch.tensor([0.3]).requires_grad_()

In [16]:
layer1_out = activation_function(inputs@layer1_weights + layer1_bias)
print("Hidden layer weights: ", layer1_out)

output_layer_out = activation_function(layer1_out@output_layer_weights + output_layer_bias)
print("Prediction: ", output_layer_out)

Hidden layer weights:  tensor([0.5500, 0.2000], grad_fn=<ReluBackward0>)
Prediction:  tensor([0.4950], grad_fn=<ReluBackward0>)


In [17]:
loss = loss_function(output_layer_out, actual)
loss.backward()
print("Loss: ", loss)

Loss:  tensor(0.2070, grad_fn=<MseLossBackward0>)


In [18]:
print('Gradient for output layer weights: ', output_layer_weights.grad)
print('Gradient for output layer bias: ', output_layer_bias.grad)
print('Gradient for hidden layer weights: ', layer1_weights.grad)
print('Gradient for hidden layer bias: ', layer1_bias.grad)

Gradient for output layer weights:  tensor([[-0.5005],
        [-0.1820]])
Gradient for output layer bias:  tensor([-0.9100])
Gradient for hidden layer weights:  tensor([[-0.0455, -0.3185]])
Gradient for hidden layer bias:  tensor([-0.0910, -0.6370])


Lastly, we'll create a more complicated network with two hidden layers contained two nuerons each. See [the post](https://www.hirahim.com/posts/backpropagation-by-example/) for how the forumlas are worked out.

In [19]:
inputs = torch.tensor([0.5])
actual = torch.tensor([0.95])

layer1_weights = torch.tensor([[0.3, 0.2]]).requires_grad_()
layer1_bias = torch.tensor([0.4, 0.1]).requires_grad_()

layer2_weights = torch.tensor([[0.1, 0.8], [0.6, 0.7]]).requires_grad_()
layer2_bias = torch.tensor([0.3, 0.2]).requires_grad_()

output_layer_weights = torch.tensor([[0.4], [0.6]]).requires_grad_()
output_layer_bias = torch.tensor([0.3]).requires_grad_()

In [20]:
layer1_out = activation_function(inputs@layer1_weights + layer1_bias)
print("Layer 1 weights: ", layer1_out)

layer2_out = activation_function(layer1_out@layer2_weights + layer2_bias)
print("Layer 2 weights: ", layer2_out)

output_layer_out = activation_function(layer2_out@output_layer_weights + output_layer_bias)
print("Prediction: ", output_layer_out)

Layer 1 weights:  tensor([0.5500, 0.2000], grad_fn=<ReluBackward0>)
Layer 2 weights:  tensor([0.4750, 0.7800], grad_fn=<ReluBackward0>)
Prediction:  tensor([0.9580], grad_fn=<ReluBackward0>)


In [21]:
loss = loss_function(output_layer_out, actual)
loss.backward()
print("Loss: ", loss)

Loss:  tensor(6.4001e-05, grad_fn=<MseLossBackward0>)


In [22]:
print('Gradient for output layer weights: ', output_layer_weights.grad)
print('Gradient for output layer bias: ', output_layer_bias.grad)
print('Gradient for hidden layer 2 weights: ', layer2_weights.grad)
print('Gradient for hidden layer 2 bias: ', layer2_bias.grad)
print('Gradient for hidden layer 1 weights: ', layer1_weights.grad)
print('Gradient for hidden layer 1 bias: ', layer1_bias.grad)

Gradient for output layer weights:  tensor([[0.0076],
        [0.0125]])
Gradient for output layer bias:  tensor([0.0160])
Gradient for hidden layer 2 weights:  tensor([[0.0035, 0.0053],
        [0.0013, 0.0019]])
Gradient for hidden layer 2 bias:  tensor([0.0064, 0.0096])
Gradient for hidden layer 1 weights:  tensor([[0.0042, 0.0053]])
Gradient for hidden layer 1 bias:  tensor([0.0083, 0.0106])
