From: 3blue1brown

The normal distribution, also known as a bell curve or a Gaussian distribution, is one of the most prominent distributions in all of probability [00:00:24]. Even when individual events are chaotic and random, it’s possible to make precise statements about a large number of events, particularly how their outcomes are distributed [00:00:05].

This distribution is common and appears in various seemingly unrelated contexts [00:00:40]. For example, if you plot the heights of a large number of people in a similar demographic, those heights tend to follow a normal distribution [00:00:46]. Similarly, the number of distinct prime factors for a large swath of very big natural numbers closely tracks a certain normal distribution [00:00:53].

The Central Limit Theorem (CLT) is a key concept in probability theory that explains why the normal distribution is so common [00:01:05].

Illustrating with the Galton Board

A Galton board is a popular demonstration that illustrates the normal distribution [00:00:00], [00:00:20].

Simplified Model

An overly simplified model of the Galton board helps illustrate the Central Limit Theorem [00:01:56]:

  • Each ball falls onto a central peg [00:02:00].
  • It has a 50-50 chance of bouncing to the left (-1) or to the right (+1) [00:02:05].
  • After a choice, it lands in the middle of an adjacent peg below, where it again faces the 50-50 choice [00:02:14].
  • For a board with five rows, the ball makes five random choices between +1 and -1 [00:02:27].
  • The final position of the ball is essentially the sum of these numbers [00:02:35].
  • Repeating this process for many balls gives a sense of how likely each bucket (representing a sum) is [00:02:46], [00:03:10].
  • The numbers are simple enough to explicitly calculate probabilities, which are reminiscent of Pascal’s triangle [00:03:23], [00:03:30].

The core idea of the Central Limit Theorem is that as the size of the sum increases (e.g., more rows of pegs), the distribution describing where that sum will fall looks more and more like a bell curve [00:03:56], [00:04:06].

Simulations of Sums

A random variable is a random process where each outcome is associated with a number [00:04:19]. For instance, a peg bounce gives -1 or +1 [00:04:29], or rolling a die gives 1-6 [00:04:38]. The Central Limit Theorem claims that as you add together more samples of a random variable, the distribution of that sum increasingly resembles a bell curve [00:04:45], [00:05:01].

This holds true even for starting distributions that are not uniform, such as a weighted die [00:06:20], [00:06:27]. For example, if a die’s distribution is skewed towards lower values, sums of 10 samples from it will still emerge as a bell curve, albeit slightly skewed [00:06:35], [00:07:07]. As the number of dice in each sum increases (e.g., from 2 to 15), the resulting distribution becomes more and more symmetric and bell-shaped [00:07:21]. This phenomenon also occurs if the initial distribution is bimodal (e.g., most probability on 1 and 6) [00:08:07].

Mean and Standard Deviation

To describe the Central Limit Theorem quantitatively, it’s necessary to understand the concepts of mean and standard deviation [00:11:25].

Mean (μ)

The mean of a distribution, denoted by the Greek letter mu (μ), captures its center of mass [00:11:43]. It’s calculated as the expected value of a random variable, which is the sum of (probability of outcome × value of variable) for all possible outcomes [00:11:51].

If the mean of an initial distribution is μ, then the mean of the sum of n such variables will be n times μ [00:13:16]. This explains why distributions of sums drift to the right [00:13:39].

Variance and Standard Deviation (σ)

To measure how spread out a distribution is, variance and standard deviation are used [00:12:10].

  • Variance: Calculated by taking the expected value of the squared difference between each possible value and the mean [00:12:20]. Squaring ensures positive numbers and simplifies the math [00:12:31].
  • Standard Deviation (σ): Denoted by the Greek letter sigma (σ), it’s the square root of the variance [00:12:53]. This provides a measure of spread that can be interpreted as a distance [00:12:59].

A key fact is that the variance for the sum of independent random variables is the sum of their individual variances [00:13:50]. If you sum n independent realizations of the same random variable, the variance of the sum is n times the variance of the original variable [00:14:20]. Consequently, the standard deviation of the sum is the square root of n times the original standard deviation (σ√n) [00:14:29]. This means distributions spread out more slowly than their mean drifts [00:14:56].

The Normal Distribution Formula

The general idea of the Central Limit Theorem is that if you realign all distributions of sums so their means line up and then rescale them so their standard deviations are 1, they approach a certain universal shape [00:15:08], [00:15:30]. This shape is described by the normal distribution formula [00:15:54].

The building blocks of the formula:

  • Exponential Decay: e^(-x) describes exponential decay [00:16:06].
  • Symmetry: Using e^(-|x|) creates decay in both directions with a sharp point [00:16:18]. Using e^(-x^2) creates a smoother, basic bell curve shape decaying in both directions [00:16:29].
  • Scaling (Standard Deviation): Introducing a constant c (or equivalently, structuring as e^(-1/2 * (x/σ)^2)) allows stretching or squishing the curve horizontally [00:16:38]. Here, σ becomes the standard deviation of the distribution [00:17:20], [00:17:24].
  • Normalization (Area = 1): For a valid probability distribution, the area under the curve must be 1 [00:17:38]. The area under e^(-x^2) is √π [00:18:21], requiring a division by √π [00:18:38]. Combined with the standard deviation scaling, the overall normalizing factor becomes 1 / (σ * √(2π)) [00:18:43], [00:18:52], [00:18:56]. This ensures the total area is 1 [00:19:06].
  • Shifting (Mean): Subtracting a constant μ from x ((x - μ)^2) allows sliding the graph left or right to prescribe the mean of the distribution [00:19:25].

The complete formula for a normal distribution, parameterized by its mean (μ) and standard deviation (σ), is:

[00:19:40]

The special case where σ = 1 and μ = 0 is called the standard normal distribution [00:19:15].

Probability Density Functions

Unlike discrete distributions, continuous distributions like the normal distribution are described by probability density functions (PDFs) [00:17:47]. You don’t ask for the probability of a particular point, but rather the probability that a value falls between two different values, which is equal to the area under the curve between those values [00:17:50], [00:22:05].

Quantifying the Central Limit Theorem

If an underlying random variable has a mean μ and standard deviation σ, then for a sum of n such variables:

  • The mean of the sum is n * μ [00:20:10].
  • The standard deviation of the sum is σ * √n [00:20:22].

The Central Limit Theorem states that as the sum size n increases, the distribution of the sum, when normalized, tends towards the standard normal distribution [00:21:02]. Normalization involves:

  1. Subtracting the expected mean (n * μ) from the sum, so the new expression has a mean of zero [00:21:02].
  2. Dividing by the expected standard deviation (σ * √n), which rescales units so the standard deviation of the new expression is one [00:21:06].

This transformed expression tells “how many standard deviations away from the mean is this sum?” [00:21:24].

The Magic of CLT

For a sufficiently large sum (e.g., 50 different values), regardless of how the initial underlying random variable’s distribution is changed, it has essentially no effect on the shape of the plot of the normalized sum [00:23:16]. All the specific information and nuance of the initial distribution gets “washed away,” and the sum tends towards the single universal shape of the standard normal distribution [00:23:31].

Formal Statement

The rigorous statement of the Central Limit Theorem is: Consider the value where n different instantiations of a variable are summed, then tweaked so its mean and standard deviation are 1. If you consider the probability that this value falls between two given real numbers, a and b, and you consider the limit of that probability as the size of your sum goes to infinity, then that limit is equal to the integral (area) under a standard normal distribution between those two values [00:24:09].

Applications and Rules of Thumb

For normal distributions, there’s a handy rule of thumb:

  • About 68% of values fall within one standard deviation of the mean [00:25:31].
  • About 95% of values fall within two standard deviations of the mean [00:25:35].
  • About 99.7% of values fall within three standard deviations of the mean [00:25:42].

Example: Rolling 100 Dice

If you roll a fair die 100 times and add the results, you can find a range where you are 95% sure the sum will fall [00:25:12].

  1. Mean of a single die roll (μ): (1/6 * 1) + (1/6 * 2) + … + (1/6 * 6) = 3.5 [00:26:07].
  2. Standard deviation of a single die roll (σ): Calculated from variance, it’s approximately 1.71 (variance is 2.92) [00:26:19].
  3. Mean of 100 rolls: 100 * μ = 100 * 3.5 = 350 [00:26:42].
  4. Standard deviation of 100 rolls: √100 * σ = 10 * 1.71 = 17.1 [00:26:47].
  5. 95% range: Two standard deviations from the mean.
    • Lower bound: 350 - (2 * 17.1) = 350 - 34.2 = 315.8 (approx. 316) [00:26:53].
    • Upper bound: 350 + (2 * 17.1) = 350 + 34.2 = 384.2 (approx. 384) [00:27:01]. Therefore, you are 95% sure the sum will fall between 316 and 384 [00:27:07].

If you divide the sum by 100, it represents the empirical average of 100 die rolls [00:27:18], [00:27:21]. The interval then tells you the range you expect for that empirical average, which should be around 3.5 [00:27:37]. The Central Limit Theorem allows computation of how close to the expected value the empirical average will likely be [00:27:51].

Assumptions and Limitations of the Central Limit Theorem

The Central Limit Theorem has three underlying assumptions:

  1. Independence: All variables being added must be independent of each other [00:28:18]. The outcome of one process does not influence another [00:28:22].
  2. Identically Distributed: All variables must be drawn from the same distribution [00:28:27]. These two assumptions are often lumped together as IID (Independent and Identically Distributed) [00:28:42].
  3. Finite Variance: The variance computed for these variables must be finite [00:30:04]. This is usually not an issue for discrete outcomes but can be for distributions with an infinite set of outcomes where the variance sum might diverge to infinity [00:30:10]. If variance is infinite, the sum might not tend towards a normal distribution, even if the first two assumptions hold [00:30:30].

The actual Galton board violates the first two assumptions: a ball’s bounce off one peg is not independent of the next, and the distribution of outcomes off each peg might not be the same [00:28:50]. The simplified model is necessary for it to be a true example of the Central Limit Theorem [00:29:29]. While the real Galton board appears to produce a normal distribution, this might be due to generalizations of the theorem that relax these assumptions [00:29:38]. It’s important to be cautious about assuming a variable is normally distributed without justification [00:29:54].

Further topics include exploring why this particular function is the one that distributions tend towards, and the role of pi and circular symmetry in its formula [00:30:48].