Theory driven science vs datadriven science

From: jimruttshow8596

The distinction between theory-driven and data-driven science has become a focal point of discussion in modern scientific discourse, prompting questions about a potential “bifurcation” in scientific methodology [02:29:00] [02:37:00]. While some initially view data-driven science as a new paradigm, it is argued that science has historically always started with data collection [02:11:00].

Historical Context and the Bifurcation

Historically, physical science benefited from a “very lucky conjunction” where fundamental theory was also highly useful [03:05:00]. However, there might be a growing divergence between “fine-grained paradigms of prediction” (large models with practical value) and “coarse-grain paradigms of understanding” [02:48:00]. This suggests a future where, for any given topic, there might be two distinct approaches: one for understanding and another for application [03:32:00].

Origins of Inductive Frameworks

The concept of induction, as introduced by David Hume in the 18th century, arose from his belief that humans and the world are not rational in a purely deductive sense, leading to a focus on associations [06:41:00]. Interestingly, the development of statistics, a mathematical technique for reducing error in measurements (like in celestial mechanics), emerged from the most deductive empirical science: classical physics [07:02:00]. Statisticians like R.A. Fisher in the 1920s developed concepts such as “sufficiency” (the minimum function required for the best estimator of a parameter) [07:27:00]. By the 1930s, hypothesis testing was developed by Jerzy Neyman and Egon Pearson [07:43:00]. These early inductive mathematical frameworks primarily focused on parameter estimation [07:51:00].

The Rise of Data-Driven Science

Modern data-driven science is characterized by its ability to solve complex problems with massive computation and data, often without providing direct theoretical insight [05:59:00].

Examples of Data-Driven Successes

AlphaFold: This system demonstrates the power of data science by solving the protein folding problem, a long-standing challenge in computational chemistry [04:04:00]. Despite its success, AlphaFold provides “zero theoretical insight” into the underlying mechanisms [05:01:00]. It leverages vast computation, even if theoretical approaches previously found the problem to be “grossly far out” without quantum computing [04:40:00]. However, it’s also noted that AlphaFold models incorporate human expertise like symmetries and conservation laws, and amino acid distance matrices, establishing constraints and priors [32:33:00].
Large Language Models (LLMs): Traditional computational linguistics, which relied on building parsers with complex rules, largely failed to understand human language [05:13:00]. Simple technologies like Transformer Technologies, combined with brute force data and computation, have created “unbelievably powerful language models” [05:38:00]. These models, like GPT-4 with an estimated 1.3 trillion parameters, initially offer “little insight” into mechanisms [06:03:00] [13:00:00].
- A key observation is the massive data discrepancy: the memory footprint for training a human to acquire language is estimated at 1.5 megabytes, while LLMs require hundreds of gigabytes—a difference of five orders of magnitude [21:40:00]. This highlights the “Poverty of the stimulus” argument by Noam Chomsky regarding innate language abilities [22:28:00]. However, human language acquisition involves constant instruction and feedback (reward signals) from parents and peers, which is a form of reinforcement learning absent in raw data training [22:46:00].

Superhuman Models and the Uncanny Valley

The concept of “superhuman models” refers to models that, by adding parameters beyond the “statistical uncanny valley” (where models typically generalize poorly out of sample), begin to perform well again [14:32:00]. These models, in a sense, “solve the problem of induction” that David Hume first identified, as they overcome the contingency problem where induction cannot guarantee certainty [14:53:00]. The success of these models suggests that complex phenomena possess high-dimensional regularities [15:21:00].

The effectiveness of gradient descent in these ultra-high-dimensional spaces, even with non-differentiable functions, is attributed to the sheer number of dimensions, ensuring that a downward gradient is always available [15:50:00]. This contrasts with earlier computational models, where even 47 parameters were considered “garbage in, garbage out” [16:27:00].

Theory-Driven Science and its Challenges

Theory-driven science aims for parsimonious, comprehensible explanations, often expressible concisely [13:54:00].

Limitations of Current AI in Understanding

Despite their power, current LLMs demonstrate limitations that highlight the ongoing relevance of theoretical understanding:

Arithmetic: LLMs like GPT-4, with trillions of parameters, struggle with basic arithmetic, performing worse than a 1970s calculator with 1KB of ROM [34:10:00].
Lack of Sentience/True Intelligence: These models are not sentient; they are purely feed-forward networks without intrinsic feedback loops or self-modification capabilities [27:51:00]. Their apparent “intelligence” can be viewed as outsourcing capabilities to tools [37:38:00].
Creativity and Discovery: LLMs are described as “pure herd” models, akin to libraries that are excellent reference materials for established knowledge but not “Discovery engines” [54:47:00]. True scientific breakthroughs often come from operating outside the “herd” or conventional wisdom [55:17:00]. While they can produce “creative” work for most human standards (e.g., screenwriting, analytical comparisons), they lack the geometric or conceptual understanding needed for Einstein-level breakthroughs [58:48:00].

Towards Conciliency: New Paradigms

The challenge is to achieve a conciliency between the predictive power of data-driven models and the explanatory power of theory [20:33:00].

Symbolic Regression as a Bridge

A promising approach involves using deep neural networks as “pre-processors for parsimonious science” [27:39:00]. This process, as demonstrated by Miles Cranmer in cosmology, involves:

Training a special type of neural network (e.g., graph neural nets that explicitly encode particle interactions) on large datasets [29:00:00].
Sparsifying and quantizing the trained network, effectively creating a “lossy encoding” of the data [29:20:00].
Applying a genetic algorithm to perform symbolic regression on this reduced representation, generating algebraic formulas that encode the regularities [29:36:00]. This method has led to the discovery of new parsimonious encodings for phenomena like dark energy, allowing for both prediction (via the large neural net) and understanding (via the derived formulas) [29:55:00] [31:01:00].

Constructs and Theories as Adaptive Systems

From a complex systems perspective, all systems, including organisms and scientific theories, can be seen as “theorizers” or “constructs” [49:54:00]. A complex system is one that, when opened, reveals “simulacra” or “mirrors of reality,” encoding its adaptive history [48:10:00]. Similarly, physical theories are propositional schemas of reality [49:27:00]. Training a deep neural net is akin to creating a theory or a rule system – a complex system itself [50:27:00]. The internal representations of neural nets can be “composable,” meaning they possess a kind of internal compositional encoding similar to “schemas” or “constructs” [51:48:00].

Occam’s Razor and Meta Occam

Occam’s razor is the principle that one should always choose the simpler explanation for a phenomenon over a more complex one if it offers no more insight [0:09:40]. In physics, this leads to parsimonious theories of objects (e.g., Dirac’s theory of quantum mechanics) [10:30:00].

However, in complex domains, where high-dimensional data seems irreducible, Occam’s razor doesn’t directly apply [10:43:00]. This is where “Meta Occam” comes into play: in certain areas of inquiry, “the parsimony is in the process, not in the final object” [11:19:00]. For example, Darwin’s theory of evolution by natural selection is a highly parsimonious process that can explain the generation of arbitrarily complicated objects, from a worm to an elephant, without needing a more complex theory for more complex objects [11:00:00]. Machine learning’s reinforcement learning and natural selection are mathematically equivalent processes that share this characteristic of generating complexity from simple, reinforcing principles [11:39:39]. Complexity science, therefore, can be viewed as “the search for meta Occamist processes” [11:08:00].

Role of Constraints in Scientific Breakthroughs

Historically, scientific revolutions and creative breakthroughs have often been spurred by “bandwidth limitation and constraints” rather than excess power [59:31:00].

Tycho Brahe and Kepler: If Brahe had possessed massive computing power and telescopes, he might not have hired Kepler, whose calculations were crucial for developing Kepler’s laws, which Newton later simplified [59:46:00].
Calculus: The development of calculus (Newton’s method of fluxions) was a human response to dealing with complex data sets, a way of “bottlenecking the phenomenon into increasingly simple sets of causal relationships” [01:00:56].
Mendeleev’s Periodic Table: Mendeleev developed the periodic table based on observing repeating patterns, predicting unknown elements, without understanding the underlying atomic structure (protons, electrons) [01:15:43]. It’s conjectured that a purely data-driven approach with full microscopic data might not have yielded such a “neat taxonomy” [01:16:38].
Darwin: Darwin, not a “big data person,” relied on anecdotes and was steeped in the 19th-century obsession with “design” [01:07:22]. His “astounding idea” of natural selection was a simple, elegant process that had been conceptually available for centuries [01:06:40].

This implies that providing too much computing power and data to humans might have hindered progress in earlier times [01:00:36]. The next stage for machine learning might involve “hobbling” models or incrementally decaying them to find the “absolute minimum that they can sustain as predictive engines” and thereby reveal underlying structures [01:02:42].

Challenges and Risks Associated with Advanced AI

The rapid advancement of AI, particularly LLMs, presents several societal and existential risks:

Misuse of Narrow AI: The immediate risk comes from people using narrow AI for malicious purposes, such as building surveillance states like China’s, which can track individuals physically and digitally [01:19:03].
“Idiocracy” Risk: As AI becomes more capable, humans may delegate more intellectual tasks to machines, potentially leading to a societal devolution where fundamental skills are lost, similar to the movie Idiocracy [01:19:34]. The modern technosphere, though an achievement, is fragile; events like a massive solar flare could collapse the grid, leaving a de-skilled populace vulnerable [01:20:25].
Accelerating “Game A”: AI could accelerate the current “Game A” (status quo systems driving consumption and growth beyond Earth’s carrying capacity), pushing society towards collapse faster [01:21:20].
Flood of Sludge: LLMs contribute to an exponential increase in low-quality information online, such as fake news sites and spam, due to reduced content creation costs [01:28:11].
Attentional Depletion: Social media, though offering useful connections, constantly depletes human attentional resources [01:30:22].

Historical Precedent and Regulation

Historically, humanity has managed risks from powerful technologies like genetic engineering (e.g., Asilomar conference, CRISPR), nuclear weapons (non-proliferation treaties), and automobiles (traffic lights, seat belts, airbags, driving tests, which led to a 95% reduction in fatality per mile from the 1920s) [01:22:26]. This suggests that empirical, incrementally informed regulations are more effective than “super Draconian proposals” [01:24:32].

Potential Solutions and Opportunities

Cognitive Synergy: An approach that combines different AI paradigms—deep learning, genetic algorithms, and symbolic AI (e.g., math machines, provers)—to leverage their respective strengths and address limitations [01:35:36]. This could allow for the combination of “perceptual equivalent power of deep learning” with “deep mathematical skills” [01:36:19].
Information Agents: The “flood of sludge” could lead to the natural evolution of personal “info agents” that filter and curate electronic information on behalf of individuals [01:28:48]. These agents, using technologies like latent semantic vector space databases and LLMs for summarization and curation, could create an ecosystem of “mutual curation” among users and expert curators [01:29:06]. This acts as a “God’s Own spam filter” for all electronic content [01:31:31].

Conclusion: A New Cycle of Science

The relationship between science and technology has often been cyclical: mechanics (technology) preceded thermodynamics (science), just as Edison’s tinkering preceded deeper scientific understanding of electricity [01:43:56]. However, modern technologies like microelectronics and GPS required science to lead technology [01:44:23]. The current era, with the rise of powerful data-driven artifacts like LLMs, might sideline traditional science if society prioritizes utility over understanding [01:47:00].

The key question then becomes: “If large language models are the steam engines of the 21st century, what is the statistical mechanics of the 21st century?” [01:46:16] This could lead to new principles for explaining adaptive reality and new “effective laws” for the complex world [01:47:12]. For instance, a “real theory of how the market works” beyond quantitative improvements on gambling [01:47:37]. This dynamic interplay between technology and fundamental understanding points towards a future where new forms of scientific inquiry emerge, driven by the capabilities of advanced models.

Tubegraph

Explorer

Table of Contents