From: 3blue1brown

Backpropagation is an algorithm used by neural networks to learn by determining how to adjust weights and biases to minimize a cost function [00:00:04]. The core task involves computing the negative gradient of this cost function to efficiently decrease cost [00:01:26].

The Challenge of Full Gradient Descent

In principle, the way weights and biases are adjusted for a single gradient descent step should depend on every single training example [00:03:21]. A true gradient descent step would involve calculating how every one of tens of thousands of training examples wishes to nudge the weights and biases, and then averaging those desired changes [00:10:00].

However, in practice, it takes computers an extremely long time to sum up the influence of every training example for each gradient descent step [00:09:33]. This computational slowness necessitates a more efficient approach [00:11:04].

Introduction to Stochastic Gradient Descent (SGD)

To address the computational inefficiency, a commonly adopted technique is Stochastic Gradient Descent (SGD) [00:10:31].

The Mini-Batch Technique

Instead of using all training data for each step, the training data is first randomly shuffled [00:09:45]. Then, it is divided into multiple mini-batches, for instance, each containing 100 training examples [00:09:48]. Each step of the learning process is then computed based on a single mini-batch [00:09:52].

Benefits and Characteristics of SGD

  • Approximation: A step computed using a mini-batch is not the actual gradient of the total cost function, as the true gradient depends on all training data [00:09:56]. Therefore, it is not the most efficient step downhill [00:10:03].
  • Computational Speedup: Despite being an approximation, each mini-batch provides a “pretty good” approximation of the gradient and, more importantly, offers a significant computational speedup [00:10:05].
  • Trajectory: If the network’s trajectory on the cost surface were plotted, SGD would resemble a “drunk man stumbling aimlessly down a hill but taking quick steps” [00:10:17]. This contrasts with the “carefully calculating man determining the exact downhill direction” of full gradient descent, who takes very slow and careful steps [00:10:21].

By repeatedly iterating through all mini-batches and making these adjustments, the network will converge towards a local minimum of the cost function [00:11:14], enabling it to perform effectively on training examples [00:11:21].