From: 3blue1brown

The calculus of the backpropagation algorithm relies heavily on the chain rule [00:00:25]. This application of the chain rule in machine learning has a distinct feel compared to how it’s typically approached in introductory calculus courses [00:00:28].

Understanding Cost Function Sensitivity

The primary goal is to determine how sensitive a network’s cost function is to its various weights and biases [00:00:49]. This understanding is crucial for making adjustments that efficiently decrease the cost function during the training process [00:00:55].

Simple Network Example (Single Neuron Per Layer)

To illustrate, consider an extremely simple network where each layer contains only a single neuron [00:00:43].

The following terms are used:

  • a(L): Activation of the last neuron in layer L [00:01:05].
  • a(L-1): Activation of the previous neuron in layer L-1 [00:01:10].
  • y: The desired output value for a given training example [00:01:27].
  • C0: The cost of the network for a single training example, calculated as (a(L) - y)^2 [00:01:32].
  • w(L): The weight connecting the previous neuron to the last neuron [00:01:49].
  • b(L): The bias of the last neuron [00:01:55].
  • z(L): A special name given to the weighted sum w(L) * a(L-1) + b(L) before it’s passed through a nonlinear activation function (e.g., sigmoid or ReLU) [00:02:01].

The flow of computation can be conceptualized as: w(L), a(L-1), b(L)z(L)a(L)C0 (along with y) [00:02:10].

Applying the Chain Rule

The goal is to determine the derivative of C with respect to w(L) (∂C/∂w(L)) to understand how sensitive the cost function is to small changes in this weight [00:02:41].

A tiny nudge to w(L) causes a nudge to z(L), which nudges a(L), directly influencing the cost C [00:03:11]. The chain rule breaks this down:

∂C/∂w(L) = (∂C/∂a(L)) * (∂a(L)/∂z(L)) * (∂z(L)/∂w(L)) [00:03:45]

Let’s compute the relevant derivatives for this simple case:

  • ∂C/∂a(L): This derivative is 2 * (a(L) - y) [00:04:07]. Its size is proportional to the difference between the network’s output and the desired output, indicating that larger differences mean even slight changes can significantly impact the cost [00:04:13].
  • ∂a(L)/∂z(L): This is simply the derivative of the chosen nonlinear function (e.g., sigmoid) [00:04:27].
  • ∂z(L)/∂w(L): This derivative is a(L-1) [00:04:37]. This means the influence of a small weight adjustment on the last layer depends on the strength of the previous neuron’s activation [00:04:53].

Gradient Components

  • Sensitivity to Bias b(L): The sensitivity of the cost to the bias b(L) is almost identical to that of the weight, just replacing ∂z(L)/∂w(L) with ∂z(L)/∂b(L) [00:05:46]. The derivative ∂z(L)/∂b(L) comes out to be 1 [00:05:58].
  • Sensitivity to Previous Activation a(L-1): To propagate backwards, it’s also important to understand how sensitive the cost function is to the activation of the previous layer, a(L-1) [00:06:10]. The initial derivative in the chain rule expression, ∂z(L)/∂a(L-1), comes out to be w(L) [00:06:15]. This allows for iterating the chain rule backwards to determine sensitivity to earlier weights and biases [00:06:33].

Extending to Multiple Neurons Per Layer

While the single-neuron example is simple, the process scales well to layers with multiple neurons [00:06:47]. The core equations remain largely the same, requiring only additional indices to track specific neurons [00:06:51].

  • Activation Indexing: a(L) is augmented with a subscript, e.g., a(L)j, to indicate which neuron within layer L it represents [00:06:59]. k might index neuron in layer L-1, and j in layer L [00:07:07].
  • Cost Function: The cost function is updated to sum the squares of differences for all neurons in the last layer: Sum over j of (a(L)j - yj)^2 [00:07:15].
  • Weight Indexing: Weights gain additional indices to specify source and destination neurons, e.g., w(L)jk for the weight connecting neuron k in L-1 to neuron j in L [00:07:33].
  • Derivative Complexity: The chain-rule derivative expression for a specific weight still looks essentially the same [00:08:15]. However, the derivative of the cost with respect to an activation in a previous layer (L-1) becomes more complex because that neuron influences the cost through multiple paths (i.e., through multiple neurons in the next layer), and these influences must be summed up [00:08:32].

Once the sensitivity of the cost function to activations in a given layer is known, the process can be repeated backwards for all weights and biases feeding into that layer [00:09:03]. These chain rule expressions provide the derivatives that determine each component in the gradient vector, which is used to minimize the network’s cost by repeatedly stepping “downhill” [00:09:23]. This is the fundamental mechanism behind how neural networks learn [00:09:15].