AI evaluations and benchmarking

From: redpointai

Evaluating AI models is crucial for understanding their capabilities and limitations. Current evaluations show impressive results in specific domains, but challenges remain in generalizing performance and ensuring real-world utility [00:14:00].

Current State of AI Performance

AI models have demonstrated impressive results in domains with clear, correct answers, such as math, coding, and certain scientific tasks [01:17:00]. This progress is expected to continue in these areas [01:22:00].

Limitations of Benchmarks

Despite impressive benchmark scores, there are significant limitations in how these translate to real-world performance:

Construct Validity: Benchmarks may subtly differ from what is truly desired in real-world applications [02:47:00]. For example, while SweepBench, developed by Princeton colleagues, is considered a good benchmark using real GitHub issues, it’s still a “far cry from the messy context of real-world software engineering” [02:57:00].
Failure to Generalize: Historically, technologies like reinforcement learning, which showed promise in narrow domains like games, failed to generalize broadly [01:36:00]. A major open question is how far current impressive performance can generalize beyond clear-answer domains [01:31:00].
Real-world vs. Exam Performance: While AI models show high scores on tests like the bar exam or medical exams, being a lawyer or doctor involves more than just taking exams, including eliciting information from patients [03:46:00]. Dramatic improvements in benchmarks do not always translate to dramatic improvements in human productivity [03:39:00].

Inference Scaling Flaws

A paper titled “Inference Scaling Flaws” investigated the scaling of reasoning models, particularly when pairing a generative model with an imperfect verifier (e.g., unit tests in coding, automated theorem checkers in math) [05:06:00]. The hope was that traditional logic-based verifiers could be perfect, allowing models to generate millions of solutions until one passes tests [06:14:00]. However, in reality, verifiers often have imperfect coverage [06:30:00]. If the verifier is imperfect, inference scaling cannot get “very far,” sometimes saturating within only about 10 invocations of the model, rather than millions [06:43:00]. This has implications for scaling models into domains without easy verifiers, such as law or medicine [07:09:00].

Evaluating AI Agents

Evaluating AI agents presents unique challenges. The current state of evaluations for agents is similar to chatbots, using static benchmarks with relatively realistic tasks [17:50:50].

Key limitations include:

Capability Reliability Gap: Benchmarks often do not provide information on reliability. A 90% score might mean the agent is good at 9 out of 10 tasks and always accomplishes them correctly, or it might mean it fails 10% of the time at any task, potentially leading to costly actions like booking the wrong flight ticket [18:20:00]. This means benchmarks give little information about whether an agent can actually be used productively [18:55:00].
Safety Concerns: Safety should be an integral part of every benchmark, not just safety-specific ones [19:05:00]. Some web benchmarks involve agents doing things on real websites, which can lead to spam or unintended actions [19:20:00]. Early agentic systems have shown failures, such as ordering DoorDash to the wrong address, where even a 1-in-N error rate is intolerable [10:13:00]. Frameworks like AutoGPT have attempted to post questions on Stack Overflow, highlighting the lack of basic safety controls where the only way to prevent such actions is for the agent to escalate every action to a human for babysitting [20:01:00].
Lack of Realism: Simulated environments for web benchmarks lose much of the nuance of real websites, creating a gap without a middle ground [19:53:00].

The AI Agent Zoo

A research team is building an “AI agent Zoo” where different AI agents collaborate on tasks in an environment [15:38:00]. This offers a different way of evaluating agents, focusing on collaboration rather than isolated competition [15:47:00]. Even for simple tasks like writing a joke (which are currently “awful”), these agents can generate millions of tokens, making progress through processes like understanding their environment, tools, and collaborators, and producing summaries [16:00:00]. This suggests that overall inference costs are likely to increase significantly [17:18:00].

Future of Evaluation

Benchmarks should be considered a necessary but not sufficient condition for evaluation [21:44:00]. For example, testing the reliability of an agent by repeating the same task multiple times (e.g., Pass@K) is a more interesting measure [21:14:00]. The ideal approach involves using human-in-the-loop evaluations in semi-realistic environments for agents that perform well on benchmarks [21:52:00]. The challenge is finding ways to keep humans in the loop without requiring constant babysitting [21:57:00]. This is a common challenge for managers with junior employees [22:01:00].

Policy Implications and Diffusion

Policy discussions often focus too much on innovation and too little on the diffusion of technology [26:09:00]. Diffusion involves how a country adopts new technology and reorganizes its institutions, laws, and norms to best leverage it [26:28:00]. While the U.S. might be doing well compared to other regions in terms of diffusion, the intensity of AI adoption (average usage time per week) is still low, possibly slower than PC adoption [27:23:00]. This could be because AI is not yet useful to many people, or it could be addressed by policy interventions like integrating AI education into curricula [28:30:00].

Learning from Past AI Waves

Lessons from past waves of AI highlight that when technologies lead to consequential decisions, public outcry and regulation often follow [31:33:00]. The focus should be on what regulation should look like to balance safety, rights, and the benefits of AI [32:14:00]. A key aspect for regulation is explainability, which means understanding the data a model was trained on and the audits performed, to make statements about its expected behavior in new settings, rather than a “neat mathematical explanation” of every neuron [33:01:00].

Role of Academia

Academia has a crucial role in AI, especially in areas beyond pure technical innovation [34:52:00]. This includes:

Interdisciplinary Research: Thinking about AI applications across various disciplines and its societal impacts [35:00:00].
Counterweight to Industry: Academia can serve as a counterweight to industry interests, similar to the relationship between medical research and the pharmaceutical industry [35:14:00]. While much of computer science focuses on producing ideas for industry, a portion should explicitly aim to provide an independent perspective [35:47:00].
Areas of Interest:
- AI for Science: Despite some overblown claims, AI is already having significant impacts on scientists, serving as a “thinking partner” for critiquing ideas, enhancing literature searches through semantic search, and creating domain-specific tools [36:12:00].
- AI and Human Minds: Research explores the ethical reasoning of models, learning from human minds to build AI, and using AI as a tool to understand human minds better [38:28:00].

Impact on Education

The future of education with AI will likely be closer to “not that much will change” fundamentally [40:16:00]. Similar to the early excitement around online courses, AI will be used, but the core value of education still lies in creating social preconditions for learning, motivation, and individualized feedback, which is difficult for AI to fully recreate [40:28:00]. There is a high variance in how AI impacts children; while it can be positive for wealthier kids with supervision (e.g., using AI tutoring apps like Khan Academy or custom-built learning apps), for others, it could lead to addiction, similar to social media [42:07:00]. Most of this AI learning for kids is expected to happen outside of traditional schooling, as schools may remain hesitant to adopt AI [44:24:00].

Broader Societal Implications

The impact of AI on society might reconcile both optimistic and skeptical views, similar to the internet [46:23:00]. While the internet transformed almost every cognitive task, its overall impact on GDP has been minimal, as new bottlenecks emerged [46:51:00]. With AI, many cognitive tasks may become automated. This could lead to a transformation in the definition of “work,” where human jobs might shift towards “AI control” – supervising AI and making value-based decisions that AI cannot [48:20:00].

One “weird prediction” is that younger generations may come to expect chatbots as the primary way of accessing information, viewing traditional search (clicking on websites) as akin to going to a library [52:34:00]. This requires preparing users with tools for fact-checking due to the statistical nature and potential for hallucination in chatbots [52:56:00].

From an investment perspective, the push and pull between decreasing per-token inference costs and increasing “inference time computes” (the amount of computation needed per query) makes it hard to predict the future [15:00:00]. However, it’s likely that token usage will continue to increase in a way that more than compensates for per-token cost decreases [15:28:00].

Overhyped vs. Underhyped

Overhyped: Agents are seen as overhyped, despite their potential [49:57:00].
Underhyped: “Boring” applications that bring significant economic value are underhyped. Examples include AI for summarizing C-SPAN meetings for lawyers [50:06:00] and AI for translating old codebases like COBOL to modern languages [53:49:00]. Another underhyped area is the integration of AI into everyday life in ways that “disappear,” like in glasses, rather than as separate, high-friction apps [54:03:00].

Ultimately, understanding the “application” of AI, rather than just calling it “AI,” would bring more clarity to discourse and reduce hype [54:32:00].

Tubegraph

Explorer

Table of Contents