AI evaluation and benchmarking

From: redpointai

Arvin Duran, a computer science professor at Princeton, focuses on distinguishing between hype and substance in AI through his newsletter and book, “AI Snake Oil” [00:00:06]. He discusses the state of agents, where evaluations succeed or fail, challenges in coordinating them, and the implications of AI for policymakers [00:00:14].

Challenges in AI Evaluation

Domains of Success and Struggle

Impressive results from reasoning models are primarily observed in domains with “clear correct answers,” such as math, coding, and certain scientific tasks [00:01:14]. A key open question is how far this impressive performance can generalize beyond these narrow domains [00:01:29].

Historically, reinforcement learning, despite generating excitement 10 years ago for its performance in games like Atari, failed to generalize significantly outside these narrow domains [00:01:36]. A similar future is possible for current reasoning models [00:02:06]. However, another possibility is that improved reasoning capabilities, such as code writing, could extend to systems that reason about obtaining information from the internet for legal or medical fields [00:02:11].

Limitations of Benchmarks

Focusing solely on benchmark results might be misleading [00:03:21]. “Construct validity” highlights that what a benchmark measures might subtly differ from what is desired in the real world [00:02:45]. For example, SweepBench, a benchmark developed by Princeton colleagues, uses real GitHub issues rather than “Olympiad-style coding problems” [00:02:57]. However, GitHub issues are still “a far cry from the messy context of real-world software engineering” [00:03:17].

Dramatic improvements in benchmarks like SweepBench do not necessarily translate to dramatic improvements in human productivity [00:03:39]. Passing bar or medical exams, for instance, does not equate to the full range of skills required to be a lawyer or doctor [00:03:46].

Real-World vs. Benchmark Success

Domain-specific, real-world evaluations are needed, alongside user feedback [00:03:56]. “Uplift studies,” which are randomized control trials where one group uses a tool and another doesn’t, can measure productivity impacts [00:04:10]. LLMs might perform well on diagnosis tasks but struggle with natural patient interaction or eliciting information, which is crucial for effective medical practice [00:04:28].

Inference Scaling Flaws

Arvin Duran’s paper, “Inference Scaling Flaws,” investigates how reasoning models might “go off the rails” [00:05:06]. The paper specifically examines a method of scaling where a generative model is paired with a verifier (e.g., unit tests for coding, automated theorem checkers for math) [00:05:49]. The hope is that traditional, non-stochastic verifiers could be perfect, allowing the model to generate millions of solutions until one passes [00:06:14].

However, in reality, verifiers can be imperfect (e.g., unit tests may have imperfect coverage) [00:06:27]. The research shows that if the verifier is imperfect, inference scaling cannot go very far, sometimes saturating within as few as 10 model invocations instead of millions [00:06:43]. This implies significant challenges for scaling models in domains without easy or perfect verifiers, such as law or medicine [00:07:09].

Evaluating Agentic AI

“Agentic AI” isn’t a single category [00:07:34]. One type is a tool that generates a report or first draft for an expert user, which can be a time-saving tool despite potential flaws [00:07:50]. Another type autonomously takes actions on behalf of the user, such as booking flights [00:08:21].

Current State and Limitations

Booking flight tickets is considered a “worst-case example for an AI product” due to the difficulty in understanding user preferences, which often requires many rounds of iteration [00:08:48]. An agent might also struggle with these preferences and end up asking numerous questions, leading to similar frustration as current online systems [00:09:48].

A major concern for autonomous agents is the high cost of errors [00:10:01]. Even a 1% error rate for tasks like booking flights is intolerable [00:10:07]. Early agentic systems have shown such failures, like ordering food to the wrong address, leading to complete loss of user trust [00:10:15].

Distinguishing Applications

A key difference lies between producing generative outputs for human review (low error cost) and automating actions on behalf of the user (high error cost) [00:10:25]. There needs to be more focus on the human-computer interaction component for generative AI systems [00:10:56].

Current evaluations for agents are similar to chatbots, using “static benchmarks” with relatively realistic tasks like fixing software engineering issues or navigating web environments [00:17:50]. However, these are “not working that well” [00:18:15].

One limitation is the “capability reliability gap” [00:18:17]. A 90% score on a benchmark doesn’t clarify if the agent reliably performs 90% of tasks correctly or fails 10% of the time at any task, potentially taking costly actions [00:18:27]. Current benchmarks provide little information on whether an agent can be trusted for real-world use [00:18:48].

Safety Considerations

Safety should be a component of every benchmark, not just specialized ones [00:19:08]. Some web benchmarks involve agents performing “stateful actions on real websites,” which could lead to spam or unintended consequences once agents become more capable [00:19:20].

Simulated environments for web benchmarks lose much of the nuance of real websites [00:19:53]. Agent frameworks like AutoGPT can sometimes take unintended actions online, such as posting questions on Stack Overflow, which demonstrates a lack of basic safety control [00:20:01]. Currently, the only way to prevent such actions is for the agent to escalate every action to a human user for babysitting [00:20:31].

Human-in-the-Loop Approaches

Duran’s team is building and improving benchmarks by creating an “AI agent Zoo” where different agents work collaboratively on tasks [00:15:36]. This differs from competitive benchmarks where agents work in isolation [00:15:50]. For example, agents were tasked with writing jokes, and even for simple tasks, they generated millions of tokens, often making progress by understanding their environment, tools, and collaborators [00:16:15]. They were integrated into Slack and given blogging tools to summarize learning [00:16:30].

The Role of Humans

Benchmarks should be seen as a “necessary but not sufficient condition” for evaluation [00:21:44]. Agents that perform well on benchmarks should then be used with “human in the loop” in semi-realistic environments [00:21:52]. The challenge is to keep humans in the loop without requiring them to babysit every single step [00:21:57].

The “Jagged Frontier” idea suggests that models will be calculator-like in some areas (better than humans) but lack common sense in others, necessitating human-AI hybridization [00:23:41]. It’s unclear whether to integrate agents into existing human collaboration tools (e.g., Slack, email) or to build new ones [00:24:06]. Visualizing and interpreting the potentially million-token logs of agent actions for high-level insight is an area of active work, with frameworks like “Human Layer” emerging [00:24:28].

Future of AI Evaluation

Rethinking Benchmarks

The “pass@k” metric, which measures the percentage of tasks an agent can accomplish multiple times in a row, is a more interesting and useful metric for reliability [00:21:20].

Academic Contributions

Academia has a crucial role in AI model evaluation and benchmarking, especially for aspects beyond “pure technical innovation” [00:34:52]. This includes understanding applications, societal impacts, and ensuring positive outcomes [00:35:01]. Academia can also serve as a “counterweight to industry interests,” similar to the relationship between medical researchers and the pharmaceutical industry [00:35:14].

AI in Science

AI for science is a “very hot area,” but early claims of revolutionary discoveries are often “overblown” and have flaws in their reproduction [00:36:16]. Nonetheless, AI is already significantly impacting scientists, serving as a “thinking partner” for critiquing ideas and enhancing literature searches through semantic search [00:36:44].

Generalization and Economic Impact

Evaluating AI progress with ROI (Return on Investment) should consider both the rapid decrease in per-token inference costs and the increase in “inference time compute” [00:15:00]. It’s predicted that token usage will likely continue to increase, more than compensating for cost decreases, leading to higher overall inference costs [00:15:28].

Duran’s paper with Z Kapor, “AI as Normal Technology,” argues that AI will not necessarily change everything in the next two years [00:55:54]. Its impact, like the internet, will unfold over decades [00:56:05]. While the internet transformed how almost every cognitive task is performed, its impact on GDP has been minimal because new bottlenecks emerge when old ones are eliminated [00:46:51]. Similarly, AI may transform workflows without leading to massive GDP increases in the short term [00:47:34].

Instead of focusing on “AGI,” Duran prefers to think about “transformative economic impacts like GDP” [00:52:12]. His view is that such impacts are “decades out,” not years [00:52:19].

The Future of Information Access

A “weird prediction” is that younger users will be trained to expect chatbots as the primary way of accessing information, mediated by a “fundamentally statistical tool that could hallucinate” [00:52:37]. This shift demands preparing people with tools for fact-checking when necessary [00:53:02]. The idea of searching websites for authoritative sources might become as anachronistic as going to a library [00:53:18].

Tubegraph

Explorer

Table of Contents

AI evaluation and benchmarking

Challenges in AI Evaluation

Domains of Success and Struggle

Limitations of Benchmarks

Inference Scaling Flaws

Evaluating Agentic AI

Current State and Limitations

Safety Considerations

Human-in-the-Loop Approaches

Future of AI Evaluation

Rethinking Benchmarks

Academic Contributions

Generalization and Economic Impact

Graph View

Backlinks