From: 3blue1brown
The calculus of the backpropagation algorithm relies heavily on the chain rule [00:00:25]. This application of the chain rule in machine learning has a distinct feel compared to how it’s typically approached in introductory calculus courses [00:00:28].
Understanding Cost Function Sensitivity
The primary goal is to determine how sensitive a network’s cost function is to its various weights and biases [00:00:49]. This understanding is crucial for making adjustments that efficiently decrease the cost function during the training process [00:00:55].
Simple Network Example (Single Neuron Per Layer)
To illustrate, consider an extremely simple network where each layer contains only a single neuron [00:00:43].
The following terms are used:
a(L)
: Activation of the last neuron in layer L [00:01:05].a(L-1)
: Activation of the previous neuron in layer L-1 [00:01:10].y
: The desired output value for a given training example [00:01:27].C0
: The cost of the network for a single training example, calculated as(a(L) - y)^2
[00:01:32].w(L)
: The weight connecting the previous neuron to the last neuron [00:01:49].b(L)
: The bias of the last neuron [00:01:55].z(L)
: A special name given to the weighted sumw(L) * a(L-1) + b(L)
before it’s passed through a nonlinear activation function (e.g., sigmoid or ReLU) [00:02:01].
The flow of computation can be conceptualized as:
w(L)
, a(L-1)
, b(L)
→ z(L)
→ a(L)
→ C0
(along with y
) [00:02:10].
Applying the Chain Rule
The goal is to determine the derivative of C
with respect to w(L)
(∂C/∂w(L)) to understand how sensitive the cost function is to small changes in this weight [00:02:41].
A tiny nudge to w(L)
causes a nudge to z(L)
, which nudges a(L)
, directly influencing the cost C
[00:03:11]. The chain rule breaks this down:
∂C/∂w(L) = (∂C/∂a(L)) * (∂a(L)/∂z(L)) * (∂z(L)/∂w(L))
[00:03:45]
Let’s compute the relevant derivatives for this simple case:
∂C/∂a(L)
: This derivative is2 * (a(L) - y)
[00:04:07]. Its size is proportional to the difference between the network’s output and the desired output, indicating that larger differences mean even slight changes can significantly impact the cost [00:04:13].∂a(L)/∂z(L)
: This is simply the derivative of the chosen nonlinear function (e.g., sigmoid) [00:04:27].∂z(L)/∂w(L)
: This derivative isa(L-1)
[00:04:37]. This means the influence of a small weight adjustment on the last layer depends on the strength of the previous neuron’s activation [00:04:53].
Gradient Components
- Sensitivity to Bias
b(L)
: The sensitivity of the cost to the biasb(L)
is almost identical to that of the weight, just replacing∂z(L)/∂w(L)
with∂z(L)/∂b(L)
[00:05:46]. The derivative∂z(L)/∂b(L)
comes out to be1
[00:05:58]. - Sensitivity to Previous Activation
a(L-1)
: To propagate backwards, it’s also important to understand how sensitive the cost function is to the activation of the previous layer,a(L-1)
[00:06:10]. The initial derivative in the chain rule expression,∂z(L)/∂a(L-1)
, comes out to bew(L)
[00:06:15]. This allows for iterating the chain rule backwards to determine sensitivity to earlier weights and biases [00:06:33].
Extending to Multiple Neurons Per Layer
While the single-neuron example is simple, the process scales well to layers with multiple neurons [00:06:47]. The core equations remain largely the same, requiring only additional indices to track specific neurons [00:06:51].
- Activation Indexing:
a(L)
is augmented with a subscript, e.g.,a(L)j
, to indicate which neuron within layerL
it represents [00:06:59].k
might index neuron in layerL-1
, andj
in layerL
[00:07:07]. - Cost Function: The cost function is updated to sum the squares of differences for all neurons in the last layer:
Sum over j of (a(L)j - yj)^2
[00:07:15]. - Weight Indexing: Weights gain additional indices to specify source and destination neurons, e.g.,
w(L)jk
for the weight connecting neuronk
inL-1
to neuronj
inL
[00:07:33]. - Derivative Complexity: The chain-rule derivative expression for a specific weight still looks essentially the same [00:08:15]. However, the derivative of the cost with respect to an activation in a previous layer (
L-1
) becomes more complex because that neuron influences the cost through multiple paths (i.e., through multiple neurons in the next layer), and these influences must be summed up [00:08:32].
Once the sensitivity of the cost function to activations in a given layer is known, the process can be repeated backwards for all weights and biases feeding into that layer [00:09:03]. These chain rule expressions provide the derivatives that determine each component in the gradient vector, which is used to minimize the network’s cost by repeatedly stepping “downhill” [00:09:23]. This is the fundamental mechanism behind how neural networks learn [00:09:15].