From: 3blue1brown

This article explores the fundamental concepts behind how neural networks learn, focusing on the role of the cost function and gradient descent, which sets the stage for understanding backpropagation.

Neural Network Structure Recap

A typical neural network for handwritten digit recognition consists of:

  • An input layer with 784 neurons, determined by the 28x28 pixel grid of the input image, where each pixel’s grayscale value (between 0 and 1) dictates a neuron’s activation [00:37:00].
  • Hidden layers, in this case, two hidden layers each with 16 neurons [01:09:00].
  • An output layer with 10 neurons, where the brightest activation corresponds to the recognized digit [01:29:00].

The activation for each neuron in subsequent layers is based on a weighted sum of activations from the previous layer, plus a bias, composed with an activation function like sigmoid or ReLU [00:51:00]. This specific network configuration has approximately 13,000 adjustable weights and biases that define its behavior [01:15:00].

The Learning Problem

The goal of neural network learning is to find an algorithm that allows the network to adjust its weights and biases to improve performance on training data [02:04:00]. This typically involves showing the network numerous images of handwritten digits with their corresponding labels, such as those from the MNIST database [01:52:00][02:34:00]. The hope is that the learned patterns generalize to unseen images [02:10:00].

Initially, weights and biases are set randomly, leading to poor performance [03:19:00]. The process of learning ultimately boils down to a calculus exercise: finding the minimum of a specific function [02:53:00].

The Cost Function

To guide the learning process, a “cost function” is defined. This function quantifies how “lousy” the network’s current performance is [03:36:00].

For a single training example, the cost is calculated by summing the squares of the differences between the network’s output activations and their desired values (e.g., 1 for the correct digit, 0 for others) [03:51:00]. A small sum indicates correct and confident classification, while a large sum signifies poor performance [04:05:00].

The overall cost function for the network is the average cost across all tens of thousands of training examples [04:18:00]. This function takes all 13,000 (or so) weights and biases as inputs and outputs a single number representing the overall “badness” of these parameters [04:53:00].

Minimizing the Cost Function: Gradient Descent

The challenge is to tell the network how to change its weights and biases to improve. This is achieved through gradient descent [10:20:00].

Core Idea of Gradient Descent

Imagine a simple function with one input and one output. To find its minimum, one can start at any input, determine the slope, and step in the direction that lowers the output (left for a positive slope, right for a negative slope) [05:51:00]. Repeating this process iteratively leads to a local minimum [06:11:00]. This is analogous to a ball rolling down a hill [06:20:00].

For functions with multiple inputs (like our 13,000 weights and biases), the concept extends to finding the steepest “downhill” direction. In multivariable calculus, the gradient of a function points in the direction of steepest ascent [07:26:00]. Therefore, the negative of the gradient points in the direction of steepest descent, indicating how to change the inputs to decrease the function most quickly [07:39:00]. The length of the gradient vector indicates the steepness of that slope [07:47:00].

The gradient descent algorithm involves computing this gradient vector, taking a small step in the downhill direction, and repeating the process [08:17:00]. For a network with 13,000 weights and biases, the negative gradient is a vector in an “insanely huge input space” that indicates how to “nudge” each parameter for the most rapid decrease in the cost function [08:33:00].

Smoothness of the Cost Function

It is crucial for the cost function to have a smooth output to allow for finding a local minimum by taking small, incremental steps downhill [09:59:00]. This is why artificial neurons use continuously ranging activations, rather than binary active/inactive states like biological neurons [10:09:00].

The Role of Backpropagation

The algorithm for efficiently computing this gradient, which is central to how a neural network learns, is called backpropagation [09:23:00]. While this video introduces the need for calculating the gradient to minimize the cost function, the detailed mechanics of backpropagation itself will be covered in the next video [09:29:00].

The gradient vector’s components tell us two things:

  1. The sign indicates whether the corresponding input (weight or bias) should be nudged up or down [10:45:00].
  2. The relative magnitudes of the components reveal which changes matter more, i.e., which weights or biases have a greater impact on the cost function [10:55:00]. This means the gradient vector “encodes the relative importance of each weight and bias” [11:19:00].

In summary:

  • The network is a function transforming pixel inputs to digit outputs, defined by weighted sums [12:22:00].
  • The cost function takes the network’s weights and biases as input and provides a single measure of its “lousiness” based on training examples [12:30:00].
  • The gradient of the cost function indicates how to adjust weights and biases to most rapidly change (decrease) the cost, effectively showing which changes matter most [12:42:00].

Network Performance and Limitations

After initializing weights and biases randomly and adjusting them repeatedly using gradient descent, the described network (two hidden layers, 16 neurons each) can classify about 96% of new, unseen images correctly [13:02:00][13:14:00]. With minor tweaks to the hidden layer structure, this can improve to 98% accuracy [13:36:00].

However, despite good classification performance, the network often doesn’t learn the intuitive patterns (like edges or loops) that humans might expect in its hidden layers [14:17:00]. Visualizing the weights of connections to second-layer neurons shows patterns that appear almost random, rather than clear edges [14:24:00]. This suggests the network finds a “happy local minimum” in the 13,000-dimensional weight space that doesn’t align with human-interpretable features [14:53:00].

Furthermore, this simple network can confidently misclassify random noise as a digit, indicating it hasn’t learned to recognize what a digit isn’t, only what it is within its tightly constrained training setup [15:09:00].

This “old technology” from the 1980s and 90s is a starting point, but newer, more sophisticated networks have evolved beyond these limitations [16:17:00].

Further Learning

For deeper engagement with this material, Michael Nielsen’s free and publicly available book on deep learning and neural networks is highly recommended, offering code and data for the example discussed [17:04:00]. Other resources include Chris Ola’s blog post and articles in Distill [17:27:00].

Discussions with experts like Leisha Lee highlight ongoing research into how modern image recognition networks learn. A paper showed that deep neural networks can memorize randomly shuffled labeled datasets with the same training accuracy as properly labeled ones, raising questions about whether they learn structure or just memorize [17:48:00]. However, subsequent research indicated that training on structured datasets leads to faster convergence to an accurate local minimum, suggesting that networks do find smarter solutions with real data [18:54:00].