From: hu-po

Probabilistic computing, also referred to as thermodynamic computing, represents a new class of computer designed to address the growing demand for computation as traditional digital hardware approaches fundamental limits like the end of Moore’s Law and Denard’s Law [04:41:45]. These specialized, unconventional machines sit “in between” quantum computers and digital computers [07:28:44].

Instead of relying on binary ones and zeros like digital computers [07:35:00] or quantum states like quantum computers [07:45:00], probabilistic computers leverage natural processes, specifically the vibration of matter (atoms and molecules), to perform computation [08:24:28]. This approach is well-suited for tasks involving randomness, such as probabilistic and generative AI [09:11:00], and can be exploited for linear algebra computations [09:16:03].

Key Players

  • Normal Computing AI

    • A startup that launched from stealth [05:59:01].
    • They are actively publishing papers and open-sourcing their code, which distinguishes them from some competitors [09:35:00].
    • Their mission is to make AI universally scalable [09:41:00].
    • The paper discussed, “Thermodynamic Natural Gradient Descent,” was released by Normal Computing on May 22, 2024 [02:49:00].
    • They use Texas Instruments chips for digital-to-analog conversion [03:11:00]. This choice is significant due to the geopolitical advantage of U.S.-based chip manufacturing [03:26:00].
  • Extropic

    • Considered a more famous probabilistic computing company [10:50:00].
    • Noted for generally not releasing their work or publishing papers, unlike Normal Computing [10:57:00].

How it Works: Thermodynamic Natural Gradient Descent (TNGD)

Probabilistic computing enables Thermodynamic Natural Gradient Descent (TNGD), a novel algorithm designed to perform second-order optimization for neural networks [07:08:00]. This is a significant advancement over the common first-order methods like SGD (Stochastic Gradient Descent) and Adam, which are primarily used in deep learning due to hardware limitations [06:26:00].

First vs. Second Order Optimization

  • First Order Methods (e.g., SGD, Adam):

    • Rely on the first derivative (slope) of the loss landscape [17:45:00].
    • They take small steps down the steepest path [14:53:00].
    • Effective for simple loss landscapes but can struggle with complex surfaces or get stuck in local minima [43:51:00].
    • More efficient to compute on digital computers compared to second-order methods [24:08:00].
  • Second Order Methods (e.g., Natural Gradient Descent - NGD):

    • Utilize the second derivative (curvature) of the loss landscape [18:07:00].
    • This provides more information about the shape of the landscape, allowing for better, potentially larger, gradient steps [18:50:00].
    • NGD explicitly accounts for the curvature of the loss landscape using the Fisher Information Matrix (or its empirical approximation) [38:43:00].
    • The Fisher Information Matrix serves as a substitute for the Hessian (second derivative matrix) in NGD [41:05:05].
    • The Hessian and Jacobian (first derivative matrix) are used to compute the Fisher Information Matrix [45:56:00].
    • While theoretically converging in fewer iterations [23:18:00], NGD has historically been computationally expensive on digital hardware (O(N^3) runtime, O(N^2) memory) [52:23:00].

The Hybrid Digital-Analog System

TNGD is implemented on a hybrid digital-analog system consisting of a GPU (digital) and an SPU (Stochastic Processing Unit - analog/thermodynamic) [25:17:00].

  1. GPU’s Role:

    • Stores the neural network model architecture [26:12:00].
    • Handles standard digital tasks like data loading and calculating the first-order gradient (Delta L) and the Jacobian and Hessian matrices [28:51:00].
  2. SPU’s Role:

    • The core thermodynamic computer [25:56:00].
    • Receives the gradient, Jacobian, and Hessian from the GPU via a Digital-to-Analog Converter (DAC) [28:00:00].
    • These inputs modify the dynamics of the SPU, allowing it to “evolve” or dissipate heat [28:14:00].
    • The SPU settles towards an equilibrium state defined by a Boltzmann distribution [55:27:00].
    • This equilibrium state provides an estimate of the natural gradient [55:54:00].
    • Samples are taken from the SPU and sent back to the GPU via an Analog-to-Digital Converter (ADC) [28:05:00].
  3. Asynchronous Operation:

    • The GPU and SPU can operate asynchronously [28:36:00] because the “clock speed” of nature (the rate at which molecules vibrate and dissipate heat) is incredibly fast – significantly faster than a GPU’s clock speed [59:40:00].
    • Crucially, numerical evidence shows that accurate samples of the natural gradient can be taken even before the SPU fully reaches equilibrium, without significantly harming performance [58:22:00].

Advantages of TNGD

  • Computational Efficiency: TNGD drastically reduces the runtime and memory complexity of NGD from O(N^3) runtime and O(N^2) memory to a linear O(N) for both, matching the efficiency of SGD/Adam [52:53:00].
  • Momentum Effect: The physical nature of the SPU introduces an emergent “momentum” effect [12:29:00]. When a new batch of data is fed, residual thermal/vibrational information from previous batches remains [15:57:00]. This built-in history helps overcome local minima, similar to explicit momentum terms in algorithms like Adam [13:22:00]. A non-zero delay time can even improve performance due to this effect [12:20:00].
  • Smooth Interpolation: The time allowed for the SPU to evolve acts as a toggle [01:01:50]. A zero evolution time is equivalent to SGD, while infinite time approaches NGD. This allows for a smooth interpolation between first-order and second-order optimization [01:01:11].

Current State and Challenges

  • Early Development: The technology is still in its early stages, with companies like Normal Computing building “demos” and proof-of-concept setups [03:09:00]. These are not yet production-ready systems for training large models like Llama 4 [03:29:00].
  • Thermal Isolation: Probabilistic computers are highly sensitive to external perturbations, requiring thermal insulation to prevent interference with the atomic vibrations [03:05:00].
  • Simulated vs. Real Hardware: The paper’s reported performance relies on a simulated TNGD, not actual physical hardware [01:18:50]. This is a common practice in early-stage quantum computing research as well [01:19:15]. The simulation itself can be computationally intensive, requiring high-end GPUs like the Nvidia A100 [01:22:50].
  • Performance Metrics: While TNGD outperforms Adam and other second-order optimizers in toy problems (e.g., CNN on MNIST, DistillBert fine-tuning on SQuAD) [01:08:52], the practical implications rely on the future availability and scalability of physical analog thermodynamic computers [01:17:07].
  • Analog-to-Digital Conversion Bottleneck: Currently, the slowest part of the hybrid system is the conversion of analog information from the SPU back into digital information for the GPU [01:04:46]. This suggests a potential future where all computations might be performed in thermodynamic/probabilistic space, eliminating the need for digital components for certain tasks [01:05:01]. However, this is likely decades away [01:05:34].

Potential Future

Probabilistic computing is unlikely to immediately replace all digital computers [01:05:51]. It may first find its niche in specialized data centers for AI training, where its advantages in second-order optimization for complex loss landscapes are most beneficial [01:05:54]. The ability to efficiently train AI algorithms and computational constraints is a significant market opportunity [01:38:09].