History of neural networks and deep learning

Early Origins and Deductive Frameworks

The concept of neural networks emerged from a surprising origin in the 1940s, initially as deductive frameworks rather than inductive ones [00:06:33], [00:10:19], [00:10:27]. This development was a “weird conjunction” of two eccentric figures: Warren McCulloch and Walter Pitts [00:08:04], [00:08:14], [00:10:01].

Warren McCulloch was a neurophysiologist at Yale interested in grounding epistemological theory in neurons [00:08:23], [00:08:31], [00:08:38], [00:09:49].
Walter Pitts was a “young weird genius” who, at age 12, found errors in Whitehead and Russell’s Principia Mathematica while hiding in a library [00:08:48], [00:09:04], [00:09:17]. He later met McCulloch in Chicago [00:09:36], [00:09:45].

In 1943, McCulloch and Pitts co-authored a paper that connected George Boole’s laws of thought (1854) and Principia Mathematica to understand how a brain might reason propositionally [00:09:56], [00:10:01], [00:10:03], [00:10:08], [00:10:12], [00:10:15]. Their work was much closer to what is now considered symbolic AI [00:10:29].

Connection to Statistics

The development of statistics, originating from efforts to reduce error in celestial mechanics measurements, led to an inductive approach to understanding reality [00:07:02], [00:07:07], [00:07:10], [00:07:15]. Key figures include:

David Hume (18th century): Focused on associations, believing humans were not rational enough to understand the world through deduction [00:06:41], [00:06:46], [00:06:48], [00:06:50].
Fisher (1920s): Developed deep concepts like “sufficiency” for parameter estimation [00:07:27], [00:07:31], [00:07:55].
Neyman and Pearson (1930s): Developed hypothesis testing [00:07:43], [00:07:46].

These inductive mathematical frameworks focused on parameter estimation, a principle modern neural networks also utilize [00:07:57], [00:08:00].

The AI Winter and its Overcoming

Neural networks faced a period known as the “AI Winter” [00:11:58].

Marvin Minsky and Seymour Papert’s Perceptrons (1969): This influential book criticized the mathematical capabilities of perceptrons, particularly their inability to handle non-linear separability problems like the XOR function [00:12:09], [00:12:11], [00:12:14], [00:12:16], [00:12:18].
The “Too Many Parameters” Problem: A core argument was that overcoming these limitations would require “deep neural nets,” which were deemed impossible to train due to the sheer number of parameters (e.g., “a hundred parameters” was considered infeasible) [00:12:23], [00:12:25], [00:12:27], [00:12:29].

This period of stagnation persisted, with even in 2002, only three or four layers being common for neural networks [00:12:42], [00:12:43], [00:12:46].

The Rise of Modern Deep Learning

The landscape began to change significantly with the fusion of neural network concepts with statistical methods in the late 1970s and 1980s [00:10:41], [00:10:44].

Backpropagation (Backprop): Developed in the late 1980s, this technology enabled training on differential functions [00:15:34], [00:15:37], [00:15:39], [00:15:41]. While a bottleneck for small models, it became a “non-problem” for very large models due to the “miracle of ultra high dimensionality” [00:15:43], [00:15:46], [00:15:48], [00:16:03].
Support Vector Machines (SVMs): Popular in the 1990s, these were a mathematical version of the underlying concepts, using “shattering operations” to separate data points with hyperplanes [00:19:50], [00:19:51], [00:19:56], [00:19:59], [00:20:01]. SVMs had issues with quadratic scaling in input [00:20:06], [00:20:08] but may see a comeback due to increased computing power [00:19:50], [00:19:51], [00:20:12], [00:20:20].
Graphical Processing Units (GPUs) and Big Data: The breakthrough use of GPUs and the availability of massive datasets in the 1990s and later fundamentally transformed the field [00:10:53], [00:10:55], [00:12:46], [00:12:48]. This allowed for the training of models with orders of magnitude more parameters [00:12:52], [00:12:53]. Modern deep learning models now use simple activation functions like ReLU because they are easily computable on GPUs, even though they lack biological realism [00:11:41], [00:11:43], [00:11:45].

Current State and Future Directions

Today, models like GPT-4 boast parameters in the order of 1.3 trillion [00:12:58], [00:13:00], [00:33:56]. These “superhuman models” have effectively addressed the problem of induction, finding high-dimensional regularities in complex domains [00:14:49], [00:14:51], [00:15:21], [00:15:25].

However, current deep learning models have limitations:

Computational Cost: Training LLMs requires hundreds of thousands of kilowatt-hours and millions of dollars [00:33:58], [00:34:00], [00:34:05].
Arithmetic Weakness: GPT-4, despite its size, struggles with basic arithmetic beyond a few digits, performing worse than a 50-year-old HP-35 calculator with 1KB of memory [00:34:10], [00:34:11], [00:34:13], [00:34:16], [00:34:50], [00:35:04], [00:35:08], [00:35:10].
Data Efficiency: Humans and animals learn from vastly smaller amounts of data compared to deep learning models [00:23:31], [00:23:35], [00:23:47], [00:23:50], [00:24:36].
Lack of Insight: While powerful for prediction, these models offer little initial insight into the underlying mechanisms [00:05:01], [00:06:00], [00:06:03].
Representational Extraction: The internal representations in LLMs are difficult to extract and interpret [00:28:18], [00:28:20].

Despite limitations, there are innovative approaches in AI research that combine deep learning with other methods:

Cognitive Synergy: Approaches like those from the OpenCog group and SingularityNET aim to integrate deep learning with genetic algorithms, symbolic AI, mathematical provers, and solvers [00:35:36], [00:35:51], [00:35:55], [00:35:58], [00:36:01], [00:36:03]. This seeks to leverage the strengths of each, for instance, combining deep learning’s perceptual power with deep mathematical skills [00:36:21], [00:36:23], [00:36:26].
Deep Neural Networks as Pre-processors for Science: Miles Cranmer’s work in cosmology uses graph neural networks to infer physical laws from astronomical data [00:28:32], [00:28:35], [00:28:38], [00:28:42]. This involves:
1. Explicitly encoding particle interactions into the neural net’s topology [00:29:10], [00:29:11], [00:29:13], [00:29:16].
2. Sparsifying and quantizing the network [00:29:20], [00:29:23], [00:29:26].
3. Applying a genetic algorithm to perform symbolic regression on the quantized data, leading to the discovery of algebraic formulas like Newton’s laws or new encodings for dark energy [00:29:36], [00:29:38], [00:29:42], [00:29:48], [00:29:51], [00:29:55], [00:29:58], [00:30:01]. This represents a new way of doing science, where the neural network handles prediction, and symbolic regression provides understanding [00:31:01], [00:31:04], [00:31:06], [00:31:10].

There is a growing recognition that techniques previously “forfeited” may return [00:18:20], [00:18:23]. For example, genetic algorithms could make a comeback in building large language models, as they are “embarrassingly parallel” and do not require the unity of memory that gradient descent methods do [00:19:05], [00:19:06], [00:19:09], [00:19:12], [00:19:20], [00:19:21], [00:19:23], [00:19:25]. This suggests that the current era of deep learning may lead to new scientific principles and effective laws for explaining adaptive reality [00:47:09], [00:47:12], [00:47:15].

Tubegraph

Explorer

Table of Contents