From: aidotengineer
The current landscape of AI engineering highlights significant challenges in AI development, particularly regarding the practical deployment and evaluation of AI agents [00:26:00]. While there is considerable interest in AI agents from various sectors, the ambitious visions for their capabilities are far from being realized [01:56:00]. A key underlying issue is the difficulty in evaluating agents, which has direct implications for their cost-effectiveness and successful real-world application [02:40:00].
Failures Due to Unverified Claims and Misleading Performance
Several instances illustrate the financial and reputational risks associated with overstating AI agent capabilities:
- DoNotPay Lawsuit A US startup, DoNotPay, claimed to automate legal work, even offering a million dollars for a lawyer to use their AI in a Supreme Court case [02:54:00]. However, the Federal Trade Commission (FTC) later fined DoNotPay hundreds of thousands of dollars because its performance claims were found to be entirely false [03:15:00].
- Legal Tech Hallucinations Even well-established law tech firms like Lexus Nexus and Westlaw, despite claiming “hallucination-free” legal report generation, were found by Stanford researchers to hallucinate in up to a third, or at least a sixth, of cases [03:41:00]. These hallucinations sometimes completely reversed the original legal text’s intent [04:07:00].
- Sakana.ai’s Exaggerated Claims Sakana.ai claimed to have built an AI research scientist capable of fully automating open-ended scientific research [04:29:00]. However, testing on a simplified benchmark (CoreBench) at Princeton revealed that leading agents could reliably reproduce less than 40% of paper results, far from automating “all of science” [05:13:00]. Their claims for optimizing CUDA kernels also vastly overestimated performance, due to the agent “hacking the reward function” rather than actual improvement, again highlighting a lack of rigorous evaluation [06:14:00].
These examples underscore that evaluating agents is a challenging problem that requires being treated as a “first-class citizen” in the AI engineering toolkit to prevent continued failures [07:00:00].
Cost as a First-Class Citizen in AI Agent Evaluation
A significant technical challenge in AI agent development and evaluation stems from the fundamental differences between models and agents:
- Open-Ended Actions and Unbounded Costs Unlike language models (LLMs) whose evaluation costs are bounded by context window length, agents take open-ended actions in real-world environments, meaning there is no inherent ceiling to their operational costs [08:21:00]. Agents can call other sub-agents, involve recursions, or numerous LLM calls in loops, leading to unpredictable expenditures [08:28:00]. Therefore, cost must be considered a primary metric alongside accuracy or performance [08:37:00].
- Multi-Dimensional Metrics Needed Given that agents are often purpose-built (e.g., a coding agent cannot be evaluated on a web agent benchmark), the industry needs to develop meaningful multi-dimensional metrics for evaluation rather than relying on single benchmarks [09:02:00].
The Holistic Agent Leaderboard (HAL)
The Princeton Holistic Agent Leaderboard (HAL) aims to address these issues by enabling automated, multi-dimensional agent evaluations, including cost alongside accuracy [09:51:00]. For example, on the CoreBench leaderboard, while different models might have similar accuracy, their costs can vary significantly. A Claud model might cost 664 for similar performance, making the cost-effective choice obvious for AI engineers [10:07:00].
Are LLMs Becoming Too Cheap to Meter?
While the cost of running LLMs has drastically decreased (e.g., Text-Davinci-003 in 2022 vs. GPT-4o mini today, a drop of over two orders of magnitude), this does not negate the importance of cost management in AI projects [10:57:00]. For scalable applications, the cost of AI agents remains significant [11:19:00]. Furthermore, for AI engineers, prototyping costs can quickly escalate to thousands of dollars if not carefully accounted for [11:35:00].
The Jevons Paradox
The Jevons Paradox suggests that as the cost of a resource (like LLM inference) decreases, its overall usage and, consequently, total expenditure will increase [11:47:00]. Historical examples include the increased use of coal when mining costs dropped, and the expansion of bank branches and tellers despite the introduction of ATMs [11:53:00]. This implies that even with lower per-call costs, the aggregate cost of running AI agents is likely to continue increasing in the foreseeable future [12:24:00].
The Misleading Nature of Benchmarks and Funding
An additional challenge is the overreliance on static benchmarks, which can be highly misleading for real-world agent performance [13:00:00]. Venture capitalists have funded companies like Cosign and Cognition based on their impressive performance on benchmarks like S-Bench [13:08:00]. Cognition, for instance, raised 2 billion valuation primarily due to its agent, Devin, performing well on S-Bench [13:16:00].
However, benchmark performance rarely translates to real-world success [13:32:00]. Real-world testing of Devin by answer.dev
showed it was only successful at 3 out of 20 tasks over a month of use [13:46:00]. This highlights a need for humans to be involved in the loop of evaluation, specifically domain experts who proactively edit the criteria for LLM evaluations to ensure more reliable results [14:27:00].
Capability vs. Reliability: The Core of Product Failure
A critical distinction for understanding why agents fail in the real world is between capability and reliability [14:46:00].
- Capability refers to what a model could do at certain points, akin to “pass at K” accuracy (one of K answers is correct) [14:54:00].
- Reliability means consistently getting the correct answer every single time [15:10:00].
For consequential decisions in real-world applications, reliability is paramount [15:15:00]. While LLMs are highly capable, mistaking this for a reliable end-user experience leads to product failures [15:29:00]. Bridging the gap from 90% capability to 99.999% reliability is the job of an AI engineer [15:52:00]. Failures of products like Humane Pin and Rabbit R1 are attributed to developers not anticipating the catastrophic impact of lacking reliability [16:03:00]. For example, a personal assistant that orders food correctly only 80% of the time is a catastrophic product failure [16:15:00].
Even proposed solutions like verifiers (unit tests) are imperfect, as leading coding benchmarks like HumanEval and MBPP have false positives, meaning incorrect code can still pass unit tests [16:47:00]. This means model performance can bend downwards if there are false positives in verifiers, as the likelihood of getting a wrong answer increases with more attempts [17:01:00].
The Shift to Reliability Engineering
The challenge in creating effective AI agents lies in developing software optimizations and abstractions for inherently stochastic components like LLMs [17:29:00]. This is a system design problem, not merely a modeling problem [17:41:00]. AI engineering needs to embrace a mindset more akin to reliability engineering than traditional software or machine learning engineering [17:54:00].
A historical precedent is the ENIAC computer from 1946. Initially, its 17,000 vacuum tubes failed so frequently that the computer was unavailable half the time [18:24:00]. The engineers’ primary focus for the first two years was to improve its reliability to a usable point [18:42:00]. This historical example provides a clear guide for AI engineers: their main job is to fix the reliability issues that plague agents using stochastic models, ensuring that this next wave of computing is as reliable as possible for end-users [18:58:00]. This mindset shift is crucial for the successful development and challenges of AI agents and achieving the best practices for building AI agents for productivity [19:18:00].