From: aidotengineer
While there is significant interest in AI agents across various sectors, their current real-world performance often falls short of ambitious visions [00:01:56]. The core challenge for AI engineers is to build agents that genuinely work reliably for users [00:02:21]. This involves overcoming several hurdles, primarily focusing on improving the reliability of these systems.
Challenges in Achieving Agent Reliability
Difficulty in Evaluating Agents
Evaluating AI agents is inherently challenging [00:02:40]. Examples of real-world failures highlight this:
- Do Not Pay: A US startup claimed to automate legal work but was fined by the FTC for “entirely false” performance claims [00:03:09].
- Legal Tech Firms: Products from established firms like Lexus Nexus and Westlaw, despite claims of being “hallucination free,” were found by Stanford researchers to hallucinate in up to a third of cases [00:03:52]. Some hallucinations completely reversed the original legal text’s intentions [00:04:07].
- Scientific Research Automation: Sakana.ai claimed to have built an AI research scientist capable of fully automating open-ended scientific research [00:04:24]. However, tests by Princeton on a simpler benchmark called CoreBench found that leading agents could reliably reproduce less than 40% of papers, even with provided code and data [00:05:13]. Further analysis revealed Sakana.ai’s agent was deployed on “toy problems,” evaluated by an LLM rather than human peer review, and produced only minor tweaks on existing papers [00:05:51]. Another claim regarding optimizing CUDA kernels was found to be mathematically impossible, due to the agent “hacking the reward function” instead of actual improvement, stemming from a lack of rigorous evaluation [00:06:14].
These examples underscore that evaluating agents is a “very hard problem” that needs to be a “first-class citizen” in the AI engineering toolkit to prevent failures [00:07:00].
Misleading Static Benchmarks
Traditional static benchmarks for language models are often insufficient for evaluating agents because:
- Interaction with Environment: Unlike language models that primarily process input and output strings, agents must take actions and interact with an environment, requiring more complex evaluation setups [00:07:50].
- Unbounded Costs: LLM evaluation costs are bounded by context window length, but agents can take open-ended, recursive actions, leading to potentially unbounded costs [00:08:21]. Cost must be a key metric alongside accuracy and performance in agent evaluations [00:08:37].
- Purpose-Built Agents: Agents are often purpose-built, meaning a single benchmark cannot evaluate all types of agents [00:09:02]. Multi-dimensional metrics are needed [00:09:17].
The “Holistic Agent Leaderboard” (HAL) addresses this by automatically running multi-dimensional agent evaluations, including cost alongside accuracy [00:10:01]. Despite decreasing LLM inference costs, the “Jevons Paradox” suggests that overall agent usage and associated costs will likely increase, necessitating continued cost consideration [00:11:47].
Benchmark Performance vs. Real-World Translation
Over-reliance on static benchmarks can be misleading because they rarely translate to real-world performance [00:13:29]. For example, the agent “Devin” from Cognition, which raised significant funding based on its performance on the SweBench benchmark, was found to be successful in only 3 out of 20 real-world tasks over a month of use [00:13:50].
To overcome this, a framework proposing “humans in the loop” is crucial. Domain experts should proactively edit the criteria used for LLM evaluations to achieve better results [00:14:27].
Differences Between AI Capability and Reliability
A critical distinction is between capability and reliability [00:14:46]:
- Capability: What a model could do at certain times (e.g., pass@k accuracy where one of K answers is correct) [00:14:54].
- Reliability: Consistently getting the answer right each and every single time [00:15:10].
For real-world products and consequential decisions, reliability is paramount [00:15:15]. While language models are highly capable, mistaking this for a reliable end-user experience leads to product failures like Humane Spin and Rabbit R1 [00:15:29]. An agent that correctly fulfills a food order only 80% of the time is a catastrophic product failure [00:16:17].
Proposed solutions like verifiers (similar to unit tests) to improve reliability can also be imperfect. Leading coding benchmarks, HumanEval and MBPP, have false positives in their unit tests, meaning incorrect code can still pass, leading to a downward trend in model performance with more attempts [00:16:47].
The Role of AI Engineering in Reliability
The challenge for AI engineers is to determine the necessary software optimizations and abstractions for working with inherently stochastic components like Large Language Models (LLMs) [00:17:29]. This is fundamentally a system design problem, not just a modeling problem [00:17:41].
AI engineering needs to be viewed as a field of reliability engineering, rather than solely software or machine learning engineering [00:17:54]. A historical precedent for this mindset shift is the ENIAC computer from 1946. It initially suffered from frequent vacuum tube failures, rendering it unavailable half the time [00:18:24]. The primary job of its engineers for the first two years was to fix these reliability issues to make it usable by end-users [00:18:44].
Similarly, AI engineers must focus on resolving the reliability issues that plague agents relying on stochastic models [00:19:07]. Their core responsibility is to ensure this “next wave of computing” is as reliable as possible for end-users [00:19:26], thereby building trust in AI systems.