Understanding Backpropagation using Mountaineering as an analogy

Harjot Kaur
4 min readAug 11, 2022

Intuition behind the algorithm!

Backpropagation algorithm is the workhorse of learning in the neural networks. It is an expression for the partial derivative(∂L/∂w) of the loss function(∂L) with respect to any weight w (or bias b) in the network. The expression tells us how quickly the loss changes when we change the weights and biases. In layman’s language, Backpropagation is the process of tuning weights to enable the neural network to improve prediction accuracy.

Let’s draw an analogy with mountaineering, to get the intuition right!

Source: https://www.istockphoto.com/fr/illustrations/mountain-climbing-team

Hypothetically, a team of 5 climbers are set to undertake a mountaineering expedition under bad weather conditions - where the front member is the lead climber, followed by 2 mid climbers and 2 porters (responsible for providing weather updates to the team). Now, the weather information is traversed from porters to the mid climbers and subsequently to the lead climber to move ahead as planned.

In the set context, backpropagation is the lead climber giving climbing directions/feedback to mid climbers and they in turn cascading the same to the porters, so that the expected summit plan is followed. Now let’s visualize this in a neural network setup.

Source: Author

Porters provide the weather updates to both mid climbers as indicated above (however, let’s assume one porter communicates more frequently than the other — emphasizing importance of weights (w)). Similarly, both mid climbers pass on the information to the lead climber (again, assuming one of the mid climber communicating more than the other — thereby introducing weights (w) here as well). This process enables the lead climber to pursue summit as planned (ŷ)

Any deviation from the plan (error)urges the lead climber to propagate back to the mid climbers and in turn to the porters.

Now, every time there is a deviation from the summit as planned, the lead climber readjusts his trust(‘weight’) on the mid climber. Likewise, mid climbers also readjusts their trust on the porters to keep in accordance with the lead climbers expectations.

This propagation of feedback to the previous layers to reduce the difference between actual and expected result is called Backpropagation.

And…the lead climber keeps changing his levels of trust (i.e. adjusting the weights) until actual summit is in line with the expected plan (minimum error/loss function).

Conceptually, this is all that is to Backpropagation.

Now, comes the inevitable — the EQUATIONS!

Heads up: Since, this piece doesn’t intend to focus on the math, I’m only putting down the equations to capture the entirety of the concept.

First things first, let’s set up notations and parameters. The diagram below shows the weight on a connection from the fourth neuron in the second layer to the second neuron in the third layer of a network:

Source: http://neuralnetworksanddeeplearning.com/chap2.html

The weighted input to the activation function for neuron j in layer l can be given as

This brings us to the 3 main equations:

  1. Backpropagation w.r.t. the output layer

This can be understood as elementwise product of two vectors. First term on the right ∇aC can be expressed as the rate of change of C with respect to the output activations. The second term on the right, σ′(zL) measures how fast the activation function σ is changing at zL.

2. Backpropagation w.r.t. the hidden layer

W, the transpose of weight matrix, intuitively moves the error backward through the network, giving us some sort of measure of the error at the output of the lth layer. We then take the Hadamard product ⊙σ′(zl). This moves the error backward through the activation function in layer l, giving us the error δl in the weighted input to layer l.

3. Backpropagation w.r.t. the weights and biases

This is a dot product of two vectors already computed above.

The Backpropagation Algorithm

In the nutshell, backpropagation is just a way of propagating the total loss back into the neural network to know how much of the loss every node is responsible for, and subsequently updating the weights in such a way that minimizes the loss by giving the nodes with higher error rates lower weights and vice versa. Equations discussed above can be summarized as:

Source: http://neuralnetworksanddeeplearning.com/chap2.html

Lastly, I hope to have added some value to the beginners and wish they have a better grasp of the algorithm now. For detailed understanding of the math, you could refer http://neuralnetworksanddeeplearning.com/chap2.html. I’ve mostly picked up the equations from this link. Also, if you have any questions/feedback, feel free to write back. Thanks.

--

--

Harjot Kaur

I'm a strategy and analytics expert passionate about simplifying ML and AI. My medium articles aim to demystify these technologies and share their applications.