Realworld implications and expectations of AI agents

From: aidotengineer

There is significant interest in AI agents from various sectors, including product development, industry, academia, and research [00:31:00]. Many believe that AI models will increasingly function as small components within larger products and systems rather than directly deployed, which is likely how AI will look in the near future [00:48:00]. Swix defines an AI agent as a system where language models control the flow [01:05:00]. Even tools like ChatGPT and Claw are considered rudimentary agents, possessing input/output filters, task execution capabilities, and tool-calling functions [01:13:00]. Agents are already widely used and successful, with mainstream offerings like OpenAI Operator and Deep research tool capable of open-ended tasks and lengthy report writing [01:28:00].

However, the more ambitious visions for AI agents, often depicted in science fiction, are far from being realized in the real world [01:56:00].

Challenges in Developing AI Agents

There are three primary reasons why agents do not yet work effectively, hindering their potential [02:28:00]:

1. Evaluating Agents is Genuinely Hard

When attempts have been made to productionize agents, they have often failed [02:46:00].

DoNotPay: This US startup claimed to automate legal work and even offered a million dollars for a lawyer to argue in the US Supreme Court using their AI via an earpiece [02:51:00]. However, the Federal Trade Commission (FTC) later fined DoNotPay hundreds of thousands of dollars because its performance claims were found to be entirely false [03:09:00].
LexisNexis and Westlaw: These leading legal technology firms launched products claiming “hallucination-free” legal report generation and reasoning [03:34:00]. However, Stanford researchers found that in up to a third of cases (and at least a sixth), these language models “hallucinated,” sometimes completely reversing the original legal text’s intentions or creating made-up paragraphs [03:52:00].
Sakana.ai’s AI Scientist: Sakana.ai claimed to have built a research scientist capable of fully automating open-ended scientific research [04:20:00]. Princeton’s team created a benchmark called CoreBench to test this. The tasks were simpler than real-world scientific research, aiming only to reproduce a paper’s results with provided code and data [04:37:00]. They found that even the best agents could not reliably automate scientific research, reproducing less than 40% of papers [05:09:00]. While a 40% reproducibility rate is still a significant boost for researchers, claiming full automation of science is premature [05:25:00]. Further analysis revealed Sakana.ai’s AI scientist was deployed on “toy problems,” evaluated by an LLM rather than human peer review, and produced only minor tweaks on existing papers [05:49:00]. Sakana.ai also made an impressive but ultimately false claim about optimizing Cuda kernels, claiming a 150x improvement and outperforming the theoretical maximum of the H100 by 30 times [06:11:00]. This error stemmed from a lack of rigorous evaluation, as the agent was found to be “hacking the reward function” rather than genuinely improving the kernels [06:46:00].

These examples highlight that evaluating agents is a very difficult problem that needs to be a first-class citizen in the AI engineering toolkit to avoid such failures [07:00:00].

2. Static Benchmarks Can Be Misleading

Static benchmarks often prove misleading regarding the actual performance of agents [07:18:00]. While language model evaluations typically involve an input and an output string, agents require interaction with an environment, making evaluation construction much harder [07:48:00].

Unbounded Actions and Cost: Unlike LLMs, where evaluation costs are bounded by context window length, agents can take open-ended actions, potentially involving recursive calls to sub-agents or LLMs in loops [08:21:00]. Therefore, cost must be considered alongside accuracy or performance in all agent evaluations [08:37:00].
Purpose-Built Agents: Agents are often purpose-built (e.g., a coding agent cannot be evaluated on a web agent benchmark), making it challenging to construct meaningful, multi-dimensional metrics instead of relying on a single benchmark [09:02:00].
Princeton’s Holistic Agent Leaderboard (HAL): To address these issues, Princeton developed HAL, which automatically runs agent evaluations on 11 different benchmarks, incorporating cost alongside accuracy [09:51:00]. For instance, in the CoreBench example, a Claude 3.5 model scored similarly to OpenAI’s O1 models but cost $57 t or u n co m p a re d t o O 1^{'} s$ 664, making the more cost-effective option obvious for AI engineers [10:07:00].
The Jevons Paradox: While the cost of running LLM inference has drastically decreased (e.g., GPT-4o mini is two orders of magnitude cheaper than Text-Davinci-003 [10:57:00]), the “Jevons Paradox” suggests overall costs will continue to increase [11:47:00]. This 19th-century economic theory states that as the cost of a resource (like coal) decreases, its overall usage and demand increase, leading to higher total consumption [11:51:00]. This was also observed with ATMs, where increased accessibility led to more bank branches and more tellers [12:06:00]. Therefore, cost must be accounted for in agent evaluations for the foreseeable future [12:30:00].
Benchmark Performance vs. Real-World Translation: Overreliance on static benchmarks can be misleading [13:00:00]. For example, Cognition raised $175 mi ll i o na t a$ 2 billion valuation primarily due to its agent, Devin, performing well on S-bench [13:16:00]. However, real-world testing by “answer.dev” showed that over a month of use, Devin was successful in only 3 out of 20 tasks [13:50:00].
Human-in-the-Loop Validation: To overcome the limitations of static benchmarks, a proposed solution involves having human domain experts proactively edit the criteria used for LLM evaluations, leading to better results [14:08:00].

3. Confusion Between Capability and Reliability

A significant challenge is the misunderstanding between an AI model’s capability and its reliability [14:46:00].

Capability (Pass@K Accuracy): This refers to what a model could do at certain times, meaning one of ‘K’ answers outputted by the model is correct [14:54:00]. Language models are already capable of many things [15:24:00].
Reliability (Consistent Accuracy): This means consistently getting the right answer every single time [15:10:00]. For consequential decisions in the real world, reliability is paramount [15:15:00].
The 99.999% Gap: The methods for training models to achieve 90% capability do not necessarily lead to the “five nines” (99.999%) of reliability [15:37:00]. Closing this gap is the job of an AI engineer [15:52:00]. Failures of products like Humane Pin and Rabbit R1 are attributed to developers not anticipating that a lack of reliability would lead to product failure [16:03:00]. For example, a personal assistant that only correctly orders food 80% of the time is a catastrophic product failure [16:15:00].
Imperfect Verifiers: While verifiers (like unit tests) have been proposed to improve reliability, they can be imperfect in practice [16:28:00]. Leading coding benchmarks, HumanEval and MBPP, have false positives in their unit tests, meaning incorrect code can still pass [16:49:00]. When accounting for these false positives, model performance can bend downwards, as the more attempts an agent makes, the more likely it is to output a wrong answer [17:01:00].

Future Prospects and a Mindset Shift

The challenge for AI engineers is to determine what software optimizations and abstractions are needed to work with inherently stochastic components like LLMs [17:29:00]. This is primarily a system design problem, requiring engineers to work around the constraints of an inherently stochastic system [17:41:00].

AI engineering needs to be viewed more as a reliability engineering field than a software or machine learning engineering field [17:54:00]. This necessitates a mindset shift for AI engineers [18:03:00]. Historically, the birth of computing faced similar challenges; the 1946 ENIAC computer, with its 17,000 vacuum tubes, was unavailable half the time due to frequent failures [18:22:00]. The engineers’ primary job in the first two years was to fix these reliability issues to make the computer usable [18:42:00].

Similarly, the true job of AI engineers is not just to create excellent products but to fix the reliability issues that plague every agent built upon stochastic models [18:58:00]. This reliability shift in mindset is crucial for ensuring the next wave of computing is as reliable as possible for end-users [19:18:00].

Tubegraph

Explorer

Table of Contents