From: hu-po

Relative entropy, also known as Kullback-Leibler (KL) Divergence, is a fundamental concept frequently encountered in machine learning, particularly in reinforcement learning papers [00:03:38]. It serves as a measure of the difference between two probability distributions [00:26:57].

Origin of KL Divergence

The term “KL Divergence” originates from two individuals, Kullback and Leibler, who introduced the concept in their 1951 paper, “On Information and Sufficiency” [01:11:21].

  • Solomon Kullback (1907-1994): Supervised a team of about 60 people at the NSA and contributed to the development of memory technologies like magnetic tape [01:51:00].
  • Richard Leibler (1914-2003): Celebrated for his work on the Venona code-breaking project and served as director of the Communication Research Division at the National Security Agency [01:58:00].

Both Kullback and Leibler were code-breakers who worked at the NSA post-World War II [01:11:00]. Their work was part of the nascent field of information theory, which aimed to use mathematical principles to crack codes [01:12:00]. Kullback notably helped crack the Enigma machine in Britain [01:26:00].

Why “Relative Entropy” is Preferred

While commonly known as KL Divergence, the term “relative entropy” is considered more descriptive [01:10:00]. The use of personal names in mathematical concepts, like “Kullback-Leibler Divergence,” can be seen as vanity names that don’t aid in understanding the underlying idea [01:12:00]. In contrast, “relative entropy” directly implies its connection to information theory and entropy, thereby helping to explain the concept [01:10:00].

Mathematical Basis: Connection to Entropy

Relative entropy is directly related to cross-entropy and entropy:

[01:38:00]

Where:

  • is the relative entropy of Q from P.
  • is the cross-entropy of P and Q.
  • is the entropy of P.

Minimizing cross-entropy is essentially equivalent to minimizing KL Divergence plus a constant, which means that for practical purposes, often minimizing one can serve the purpose of minimizing the other, especially when the entropy of P is stable or unimportant to the optimization [01:56:00].

Types of Entropy

The term “entropy” is overloaded, referring to several distinct but related concepts [01:04:00]:

  1. Thermodynamic entropy [01:06:00]
  2. Entropy in classical statistical mechanics [01:08:00]
  3. Entropy in quantum statistical mechanics [01:09:00]
  4. Entropy in information theory [01:11:00]
  5. Algorithmic entropy [01:18:00]

The relative entropy specifically refers to entropy in information theory [01:11:00].

Information Theory Entropy (Shannon Entropy)

Entropy in information theory was developed by Claude Shannon [01:54:00]. At its core, entropy is a measure of information or uncertainty [01:57:00].

John Von Neumann advised Shannon to use the term “entropy” because a similar concept existed in physics (thermodynamics and classical statistical mechanics), and Ludwig Boltzmann had used the letter ‘H’ to describe it [01:08:00].

  • Fair Coin vs. Biased Coin Example:
    • A fair coin flip (50% heads, 50% tails) yields maximum uncertainty and therefore maximum entropy (e.g., 1 bit of entropy for a binary outcome) [01:13:00].
    • A biased coin flip (e.g., 70% heads, 30% tails) yields less uncertainty and therefore lower entropy (e.g., 0.88 bits) [01:21:00].
    • Entropy can be thought of as a measure of “expected excess surprise” [01:26:00].

This concept extends to algorithmic entropy, where the entropy of a string of symbols is the length of the shortest computer program that prints it out [02:51:00]. This relates to how code-breakers would efficiently encode messages: more frequent symbols require fewer bits to encode for maximum efficiency [02:22:00].

Classical Statistical Mechanics Entropy (Boltzmann Entropy)

Boltzmann’s work in classical statistical mechanics connected the concept of entropy to probability [02:38:00]. His famous equation, inscribed on his grave, describes how in any random process, the final arrangement of matter is more likely than the initial arrangement, linking to the second law of thermodynamics [02:41:00].

The idea of entropy can be traced back to the 1800s with the Carnot cycle and the study of heat and gases [02:52:00]. Boltzmann’s probabilistic approach to entropy, which describes how molecules in a gas tend to become more random and fill their container, influenced Von Neumann, who then advised Shannon, leading to the development of information-theoretic entropy [02:52:00].

Practical Use in Machine Learning

KL Divergence is widely used in machine learning:

  • Regularization: It acts as a regularizer in loss functions to prevent a model’s policy (a neural network’s probability distribution over actions or outputs) from drifting too far from a reference policy [02:25:00].
    • For example, in reinforcement learning algorithms like DeepSeek R1 and Kim K 1.5, a small beta-weighted KL term is added to the optimization objective [03:52:00]. This ensures that the updated policy remains close to the previous policy, preventing drastic changes during iterative training [02:57:00].
    • It is also seen in models like autoencoding variational Bayes [05:05:00].
  • Distance Metric: It quantifies how inefficient it is to assume distribution Q when the true distribution is P [07:09:00]. It’s not symmetric, meaning order matters [09:02:00]. A KL Divergence of zero indicates identical distributions [08:48:00].

Implementation Details: Numerical Stability

Implementing KL Divergence requires careful handling of numerical stability issues due to the nature of the formula () [03:37:00]:

  • Division by Zero: If is zero, the division becomes undefined [03:40:00].
  • Logarithm of Zero: If is zero, taking its logarithm leads to negative infinity [03:47:00].
  • Overflow Issues: KL Divergence can become very large if the difference between and is significant, leading to overflow, especially with lower-precision data types like float16 [03:57:00].

To mitigate these issues, machine learning models often work in a logarithmic scale [04:29:00]. Logarithms convert multiplication to addition and division to subtraction (), simplifying derivatives and preventing underflow or overflow by “smoothing out” very large or very small numbers [04:49:00].

KL Divergence Implementations in ML Frameworks

JAX

JAX’s kl_div function handles numerical stability by using promote_args_inexact and jnp.where to ensure that P and Q values are not zero before calculations [03:32:00]. Deeper in its implementation, kl_div calls relative_entropy, indicating a preference for the more descriptive name [03:41:00]. For complex probability distributions where analytical summation is impractical, KL Divergence is often estimated using Monte Carlo techniques [03:25:00].

PyTorch

PyTorch offers multiple implementations: nn.functional.kl_div and nn.KLDivLoss [03:53:00]. The functional API directly calls a C++ implementation that applies the relative entropy formula, with a conditional check (if log_target else) to handle inputs already in log-space, avoiding redundant log computations [03:35:00]. Additionally, torch.distributions.kl.py contains numerous specific KL Divergence implementations tailored to different distribution types (e.g., Beta, Binomial, Categorical), making it an object-oriented approach [03:50:00].

TensorFlow

TensorFlow, though once dominant, has seen a significant decline in usage, with PyTorch now holding the largest market share in research papers [03:42:00].

TinyGrad

TinyGrad, a machine learning framework focused on minimizing lines of code, does not have a dedicated KL Divergence implementation [03:44:00]. This design choice is based on the fact that minimizing cross-entropy is mathematically similar to minimizing KL Divergence (differing only by a constant related to the entropy of the true distribution) [03:48:00]. Therefore, users can achieve similar regularization effects using cross-entropy, adhering to TinyGrad’s philosophy of code simplicity [03:54:00].