Machine learning and neural networks

Introduction

The field of machine learning (ML), particularly deep neural networks, represents a significant shift in scientific methodology, moving towards data-driven approaches with practical applications, often outpacing traditional theory-driven science [00:02:52]. This “bifurcation” in science highlights a division between paradigms focused on prediction and those on understanding [00:03:03].

Data-Driven vs. Theory-Driven Science

Initially, all scientific endeavors begin with data collection [00:02:12]. However, a distinction can be made between “fine-grained paradigms of prediction,” which utilize large models with practical value, and “coarse-grain paradigms of understanding,” focused on fundamental theories [00:02:52]. Historically, physical science was fortunate to have a conjunction of both, where fundamental theory proved very useful [00:03:05].

However, in certain complex domains, data science has achieved significant breakthroughs where traditional theory-driven methods have fallen short. Two prominent examples include:

Protein Folding: AlphaFold, despite being developed by a relatively small group with “a shitload of computation,” vastly outperformed previous methods of protein folding without providing direct theoretical insight [00:04:47]. Traditional computational chemistry efforts, which had considered protein folding the “Fermat’s Last Theorem” of the field, struggled for decades [00:04:26].
Language Understanding: Traditional computational linguistics, relying on rule-based parsers, made limited progress in understanding the “only marginally lawful nature of human language” [00:05:13]. In contrast, “Transformer Technologies” coupled with “Brute Force data and Brute Force computation” have yielded “unbelievably powerful language models” [00:05:38]. These models provide little initial insight into mechanisms but generate numerous examples that can be analyzed through induction and abduction [00:06:03].

Historical Context of Neural Networks

The origins of neural networks are not rooted in induction, as commonly perceived today, but in deductive frameworks from the 1940s [00:06:33].

Early Development: The modern understanding of induction stems from David Hume in the 18th century, who focused on associations, believing humans and their created world were not rational through deduction [00:06:46].
Statistics and Error Reduction: The development of statistics in the 19th and early 20th centuries by figures like Fisher, Neyman, and Pearson, arose from the need to reduce error in measurements in fields like celestial mechanics [00:07:02]. These inductive mathematical frameworks focused on parameter estimation, a concept still central to modern neural networks [00:07:54].
McCulloch and Pitts: In the 1940s, Warren McCulloch (neurophysiologist interested in grounding epistemology in neurons) and Walter Pitts (a young genius interested in logic) converged their interests [00:08:17]. Pitts, at 12, found errors in Whitehead and Russell’s Principia Mathematica, an attempt to ground mathematics in logic [00:09:00]. Their 1943 paper was a “weird conjunction” of Boole’s laws of thought and the Principia, aiming to understand how a brain might reason propositionally [00:10:01]. Thus, the early history of neural networks was about deduction and logic, closer to symbolic AI [00:10:19].
Fusion and Growth: The fusion of these deductive neural network concepts with inductive statistical methods began in the 1970s and 1980s [00:10:42]. The real breakthrough came in the 1990s with “big data” and Graphics Processing Units (GPUs), leading to the complex landscape of technical phenomena seen today [00:10:53].

Evolution and Breakthroughs in Neural Networks

Early neural networks faced significant challenges:

AI Winter: The “AI winter” was partly due to criticisms of neural networks’ mathematical capabilities, such as their inability to solve non-linear separability problems like the XOR function [01:11:56]. Marvin Minsky and Seymour Papert’s 1969 book on perceptrons argued that overcoming these limitations would require deep neural networks, which were then considered infeasible due to the “too many parameters to train” problem [01:12:10].
Backpropagation: The development of backpropagation in the late 1980s was a crucial technological advancement [01:15:34]. Initially, it posed a bottleneck for small models because it only worked on differentiable functions [01:15:39]. However, on very large models, this becomes a non-issue because “if you have… a hundred thousand dimensions there’s always one that’s pointing down” [01:15:53]. This “miracle of ultra high dimensionality” allows gradient descent to work on many more problems than previously thought [01:16:03].
GPU Breakthrough: The ability to scale neural networks to many layers (e.g., eight or nine) became possible only after the GPU breakthrough in the mid-2000s [01:12:46]. This enabled models like GPT4, which reportedly has around 1.3 trillion parameters [01:13:00].

Superhuman Models and Complexity

These massive models, termed “superhuman models,” operate in a space “Way Beyond human understanding” [01:13:05]. They demonstrate a phenomenon where, contrary to fundamental statistical theory, adding parameters to a model, even after an initial dip in performance (the “statistical uncanny valley”), eventually leads to better generalization [01:14:32]. These models, in a sense, address the problem of induction, which Hume pointed out: induction is always contingent and cannot guarantee certainty [01:14:55]. The success of superhuman models suggests that “there are regularities in the high dimensions” of complex phenomena, revealing something about the nature of complexity itself [01:15:21].

Insights from Human Learning and Limitations of Deep Learning

Despite their computational power, current deep learning models exhibit significant differences from human learning:

Data Efficiency: Human language learning has an estimated memory footprint of 1.5 megabytes for a training set, whereas large language models require hundreds of gigabytes, a difference of five orders of magnitude [01:21:40]. Noam Chomsky’s “Poverty of the stimulus” argument highlights that humans learn language with far less raw data, suggesting strong priors or innate abilities [01:22:23]. This efficiency might be due to constant, albeit subtle, instruction and feedback from parents and peers, which acts as a “reward signal” [01:22:51].
Generalization from Few Examples: Humans and animals possess algorithms that allow them to “lever up from much much much smaller levels of examples” [01:23:50]. For instance, a human can learn to play a new war game after a few thousand sessions, drawing general principles, while an AI using evolutionary approaches might require millions of plays of a single game to achieve similar proficiency [01:24:11]. This “qualitatively utterly distinct” learning process suggests “a whole lot we don’t yet know” about human intelligence [01:25:12].
Arithmetic Weakness: Despite their scale, GPT4 and similar models struggle with elementary arithmetic, outperforming an HP-35 calculator from the 1970s (which had 1K ROM) only for two-digit additions [01:34:09]. This illustrates that these models outsource fundamental capabilities to tools rather than internalizing them, unlike human intelligence [01:37:38].
Heuristic Induction: Humans excel at “heuristic induction”—finding unconscious heuristics that efficiently solve problems, like an outfielder catching a baseball by maintaining a consistent angle [01:38:28]. Deep learning is not particularly good at the explicit creation of such heuristics [01:39:04].

Neural Networks as Pre-processors for Science

A new approach to science involves using deep neural networks as “pre-processors for parsimonious science” [01:42:07].

Symbolic Regression: Researchers like Miles Cranmer are using deep neural networks (specifically graph neural networks, which encode explicit process topologies like particle interactions) to infer scientific laws [01:29:00]. By taking a “lossy encoding” of data from the neural network and then running a genetic algorithm to perform “symbolic regression,” algebraic formulas can be produced [01:29:31]. This has led to the discovery of new parsimonious encodings for phenomena like dark energy [01:29:55].
Separating Prediction and Understanding: This method allows for a division of labor: the large neural network handles prediction, while the simplified, sparser model derived from it provides understanding [01:30:59]. This mirrors older statistical techniques like principal component analysis or factor analysis, where high-dimensional data is reduced to “lower dimensional manifolds” to build “mechanistic causal Theory” [01:31:21]. However, this new application extends to complex phenomena like language, social institutions, or protein folding, which were previously intractable [01:31:55]. These models benefit from human expertise in establishing constraints and priors (e.g., symmetries, conservation laws in protein folding) [01:32:38].

Constructs and Schemas

Complex systems, unlike purely physical ones, “encode reality,” reflecting adaptive history [01:48:06]. They contain “simulacra” or “mirrors of reality,” much like a human brain containing 3.5 billion years of evolutionary history [01:48:24]. These systems find “parsimonious encodings” or “schemas” of coarse-grained history [01:48:51].

Complex Systems as Theorizers: A “complex system theory is a theory about theorizers” [01:49:49]. Every element within a complex system acts as a theorizer of its world; for example, a bacterium following glucose has a “theory that following glucose is a useful thing to do” [01:50:00].
Deep Learning and Theory Creation: Training a deep neural network is akin to creating a “theory of the phenomenon” and building a “complex system” or an “organism” [01:50:27]. These schemas must be robust, evolvable, extensible, and composable [01:51:06]. Research on “compositional semantics” has shown that trained neural networks can possess an “internal compositional encoding,” which is closer to the concept of schema than a simple, massive diffuse encoding [01:51:30]. Techniques like variational autoencoders achieve this by bottlenecking the network to capture regularities in specific layers [01:52:31].

Creativity and Scientific Discovery with AI

The role of AI and language models in creativity and scientific discovery is complex:

Compositional Creativity: Compared to the “vast herd of humans,” large language models can be very creative in compositional tasks, such as generating clever plot ideas or interesting dialogue, especially when their capabilities are orchestrated [01:55:40]. They can achieve results comparable to paid professionals with significantly less human labor [01:56:18].
Analytical and Synthesizing Capabilities: These models are not reliable “libraries” for factual data (as they can “fabulate like crazy”) but are strong “quick and dirty analysts and synthesizers” [01:57:51]. They excel at comparing and contrasting complex concepts, demonstrating a high level of analytical skill in tasks like comparing literary works or political theories [01:58:22].
Limitations in Breakthroughs: However, these models, in their current form, are unlikely to achieve “true scientific revolutions” or breakthroughs like Einstein’s visualization of relativity [01:59:01]. Creativity in science often arises from “not being a part of the herd” and operating under “bandwidth limitation and constraints” [01:59:31].
Historical Precedent: The history of human invention shows that constraints often lead to breakthroughs, such as Kepler and Newton deriving simpler laws from complex planetary data [01:00:17]. Similarly, the development of calculus was a human response to dealing with complicated data sets under limited computational power [01:01:00]. Imposing limitations on models might be the next stage in their evolution, pushing them towards “true scientific revolutions” rather than mere predictive ones [01:02:42].

Occam’s Razor and Meta-Occam

Occam’s Razor: This heuristic, formulated by a 13th-century Scholastic, suggests that “one should not Advocate or generate a plurality without necessity” [01:09:40]. In science, it means choosing the simplest explanation with the fewest parameters [01:10:00]. Physical sciences extensively apply this principle to produce “beautiful simple things” like Dirac’s theory of relativistic quantum mechanics [01:10:38].
Meta-Occam: In complex domains, where high-dimensional systems often seem irreducible, Occam’s Razor might not directly apply to the final object [01:10:48]. However, “Meta-Occam” is the notion that “the parsimony is in the process not in the final object” [01:11:21]. For example, Darwin’s theory of evolution by natural selection is incredibly parsimonious as a process, explaining the complexity of both a worm and an elephant with the same simple principles [01:11:01]. Machine learning, particularly reinforcement learning, shares this quality, being mathematically equivalent to natural selection [01:11:42].
Physics vs. Complexity Science: Physics offers “infinite models for minimal objects” (e.g., the multiverse hypothesis to explain fine-tuning) [01:14:02]. Complexity science, conversely, offers “parsimonious meta-outcome processes” (like natural selection or reinforcement learning) that generate “very non-parcimonious objects” [01:14:55]. This highlights that complexity science seeks “meta Alchemist processes” – relatively simple rule systems with open-ended properties, much like alphabets generating language [01:15:08]. Mendeleev’s periodic table is an example of recognizing a “quasi-harmonic melodic geometric pattern” without understanding the underlying atomic structure, predicting elements that didn’t exist [01:16:09].

Existential Risk and Regulation

The discussion around the “existential risk” of AI and language models is often characterized by “undisciplined” and “neurotic” narratives [01:17:29].

Real Risks vs. Overheated Speculation: While there are significant risks, including the “Eleazar paperclip maximizer” scenario, these are not imminent [01:17:56]. Current “overheated speculations” are largely seen as “marketing” to secure resources for longer-term problems [01:18:46].
Types of Risks:
- Misuse of Narrow AI: People doing “bad things with narrow AI,” such as building state-of-the-art police states using facial recognition and data tracking, as seen in China [01:19:03]. While not existential, these pose a significant risk to the world [01:19:28].
- “Idiocracy Risk”: As AI becomes more capable, humans may “delegate our capacities to the machines” and stop investing in developing intellectual skills, leading to a societal de-evolution [01:19:57]. This raises concerns about the fragility of the technosphere; a major solar flare could disable infrastructure, leaving a de-skilled populace vulnerable [01:20:30].
- Acceleration of “Game A”: AI could accelerate the current “game A” (the status quo, heading towards unsustainability) by making manufacturing cheaper, raw material extraction easier, and fostering technological advancements [01:21:23]. This could halve the time humanity has to address global challenges [01:21:49].
Historical Precedent for Regulation: History offers precedents for managing technological risks:
- Genetic Engineering: Recombinant DNA (1970s) and CRISPR (late 1980s) saw self-moratoriums and a lack of stringent regulation, yet major disasters were averted [01:22:28].
- Nuclear Weapons: After the first atomic bomb in 1945, there were early ideas for regulating nuclear materials (Truman) and later non-proliferation treaties (post-Cuban Missile Crisis in the 1960s) [01:23:41].
- Automobile: Over 100 years, fatalities per mile from automobiles have reduced by 95% due to incremental regulations like traffic lights, seatbelts, airbags, and driving tests [01:23:47].
Call for Empirical Approach: The current debate around AI and language models needs “empirically informed discussion” and “small regulatory interventions” rather than “super Draconian proposals” or “science fiction prognostication” [01:24:36].

Future Outlook

Rapid Advancement: The rapid acceleration of AI and language models development means we are on a steep part of the S-curve of technological progress [01:25:03]. Costs for building models are decreasing rapidly, with small state-of-the-art models costing as little as $1,000 [01:26:50].
Phase Changes: There’s speculation about new “phase changes” in AI, such as GPT5 being trained on video, which might allow it to “induce physics” and gain a “completely new way” of understanding reality [01:26:00].
Information Agents: One potential positive application of these technologies is the development of “info agents” [01:28:40]. These AI agents would filter information, curate content, and interact on behalf of individuals, buffering them from the overwhelming “flood of sludge” on the internet [01:28:47]. This could lead to an ecosystem of mutual curation and more controlled information flow, similar to how spam filters solved the email meltdown of the mid-1990s [01:31:07].
Human Adaptation: While some may choose to “counter technology by ignoring it,” history shows that humans adapt to dangerous technologies incrementally, as seen with fire and language [01:32:50]. The future may see a bifurcation, with some individuals becoming “completely hypnotized” by advanced attention-hijacking technologies, while others develop “a whole different pattern away from it,” demonstrating humanity’s adaptive capacity [01:34:04].

Tubegraph

Explorer

Table of Contents