AI agent hype versus reality

From: redpointai

Arvind Narayanan, a computer science professor at Princeton, focuses on distinguishing between hype and substance in AI, a theme explored in his newsletter and book “AI Snake Oil” [00:00:05]. This perspective is particularly relevant when discussing the state of AI agents [00:00:12].

Current Effectiveness and Limitations of AI Agents

Impressive results from reasoning models are primarily observed in domains with “clear correct answers,” such as math, coding, and certain scientific tasks [00:01:17]. While this progress is expected to continue, the extent to which this performance can generalize to other domains remains an open question [00:01:29].

Historically, excitement around reinforcement learning (RL) ten years ago for games like Atari did not generalize broadly to other domains [00:01:38]. This raises a question for current reasoning models: will they similarly struggle to generalize outside narrow domains, or will their improved reasoning capabilities allow for broader application, such as in law or medicine, by leveraging internet information [00:02:06]?

Benchmarks vs. Real-World Productivity

Current benchmarks, like SweepBench (developed by Narayanan’s Princeton colleagues), aim for realism by using real GitHub issues instead of “toy” Olympiad-style coding problems [00:03:00]. However, even these are a “far cry from the messy context of real-world software engineering” [00:03:17].

While thousands of people use these models productively, dramatic improvements on benchmarks don’t necessarily translate to dramatic improvements in human productivity [00:03:39]. Narayanan likens this to models performing well on bar or medical exams, noting that “being a lawyer or doctor not just constantly taking those exams” [00:03:51].

Future evaluations will require:

Domain-specific, real-world assessments [00:03:56].
Uplift studies, which are randomized control trials measuring productivity impacts of tool access [00:04:10].
Considering tasks beyond core diagnosis, like natural patient interaction and eliciting information, where LLMs currently struggle [00:04:30].

Inference Scaling Flaws

Research on “inference scaling flaws” highlights limitations when pairing a generative model with an imperfect verifier (e.g., unit tests in coding or automated theorem checkers in math) [00:05:06]. The hope is that perfect logic-based verifiers could allow models to generate millions of solutions until one passes tests [00:06:20]. However, in reality, unit tests may have imperfect coverage [00:06:30]. This research shows that if the verifier is imperfect, inference scaling can be severely limited, sometimes saturating within just 10 invocations instead of millions [00:06:46].

This has significant implications for scaling models into domains without easy verifiers, such as law, medicine, or accounting, where human oversight is imperfect and costly [00:07:09].

Hype vs. Reality: Agentic AI

AI agents are not a single category [00:07:37].

Where Agents Work (Generative Systems)

One type of agentic AI that shows promise is a tool that assists experts by generating reports or first drafts (e.g., Google Deep Research) [00:07:42]. The user, presumably an expert, reviews the output, knowing it may have flaws, but still benefits from the time-saving and first-draft capabilities [00:08:02]. This has “pretty well motivated” product-market fit [00:08:19].

Where Agents Struggle (Autonomous Action)

Another type of agentic AI involves systems that autonomously take actions on a user’s behalf, such as booking flight tickets [00:08:24]. Narayanan argues that flight booking is “almost the worst case example for an AI product” to have product-market fit [00:08:50]. The difficulty lies in the system understanding user preferences, which often emerge during an iterative search process (e.g., 10-15 rounds of iteration) [00:09:09]. An autonomous agent would have to ask a barrage of questions, leading to similar user frustration as current manual processes [00:09:48].

Furthermore, the “cost of errors is high” for autonomous actions [00:10:01]. An error rate of even “one in ten attempts is completely intolerable” if it means booking the wrong flight or ordering DoorDash to the wrong address [00:10:07].

The key differences between effective and struggling applications are:

Generative outputs for user review (low cost of errors) vs. automating actions on user’s behalf (high cost of errors) [00:10:25].
The challenge of “eliciting the user preferences” is often half the battle [00:10:42].

Narayanan suggests a greater focus on the “human computer interaction” component, beyond purely technical problems [00:10:56]. He notes that the optimism for agents comes from chatbots gradually evolving to be agentic, performing searches and running code, suggesting a gradual complexity evolution rather than a single “killer app” [00:11:11]. The lack of a clear definition for “agent” further complicates understanding their progress [00:11:38].

Challenges in AI Agent Evaluation

The current state of agent evaluations is akin to chatbots, relying heavily on static benchmarks like SweepBench [00:17:50]. While these benchmarks aim for realism (e.g., fixing software issues, navigating simulated web environments), they have limitations:

Capability Reliability Gap: A 90% score on a benchmark doesn’t clarify if the agent reliably succeeds at 9 out of 10 tasks or fails 10% of the time at any given task with potentially costly actions (like booking the wrong flight) [00:18:20]. Benchmarks currently provide little information on this reliability for real-world use [00:18:50]. The “pass@k” metric, where the same task is attempted multiple times to measure consistency, is a step towards addressing this [00:21:23].
Safety: Safety should be integral to every benchmark, not just specific ones [00:19:08]. Some web benchmarks involve agents taking “stateful actions on real websites,” which could lead to spam or unintended consequences [00:19:24]. Frameworks like AutoGPT have demonstrated this risk, attempting actions like posting questions on Stack Overflow without human intent [00:20:01]. Basic safety controls, beyond requiring constant human babysitting, are not yet integrated into agent evaluation [00:20:41].

The “middle ground” between simulated environments (lacking real-world nuance) and letting agents loose on the internet is challenging [00:20:51]. Narayanan suggests that being good at a benchmark should be a “necessary but not sufficient condition” [00:21:46]. Agents should then be used in “semi-realistic environments” with human supervision, focusing on finding ways to keep the human in the loop without constant babysitting [00:21:54].

Future of AI Agents

Collaborative Agents and Human-Agent Teams

Narayanan’s team built an “AI agent Zoo” where different agents collaborate on tasks, like writing jokes (though the jokes were “awful,” the point was collaboration) [00:15:36]. This highlights that agents are “more naturally collaborative” than competitive in isolation [00:15:53]. Even for simple tasks, agents generate millions of tokens to understand their environment, tools, and collaborators, making progress but also indicating high inference costs [00:16:47].

The future will involve teams of humans and agents working together [00:23:30]. The “Jagged Frontier” idea suggests that models will excel at certain things (like calculators) but lack the common sense of a child in other areas, necessitating hybridization with human capabilities [00:23:41]. Open questions include whether to integrate agents into existing human collaboration tools (like Slack, email, blogging) or build new ones [00:24:04]. New tools are needed for humans to visualize and interpret the millions of tokens of an agent’s actions (e.g., Human Layer framework) [00:24:24].

Form Factor and Ubiquitous AI

The “right kind of form factor for AI for most everyday uses” is uncertain [00:12:13]. AI might be constantly monitoring and offering improvements in conversations and workplaces, integrated into workflows [00:12:26]. This could range from special-purpose apps (like ChatGPT) to integration within existing software (like Photoshop’s AI features) or even agents monitoring screenshots and phone activity [00:12:40].

Narayanan expresses interest in AI integrated into smart glasses (like Meta Ray-Ban), where AI can see everything the user sees without device mediation [00:13:25]. Examples include remembering object locations (e.g., lost keys) or real-time language translation in foreign countries [00:14:16]. The battery life of current devices (e.g., 2 hours) is a significant constraint [00:14:03].

Underhyped Applications

Beyond the hyped applications, “boring things that are not sexy to talk about” can bring significant economic value [00:50:06]. Examples include:

Summarizing hours of C-SPAN meetings for legal professionals [00:50:14].
Translating old codebases (like COBOL) to modern languages [00:53:49].

These applications unlock “enormous value” but are often overlooked in the hype cycle [00:53:54].

Policy and Societal Implications

Export controls on AI hardware are more effective than on models, which are shrinking in size and harder to limit in diffusion [00:25:22]. There’s a tendency to focus too much on “innovation” and too little on “diffusion”—the process of adopting technology and reorganizing institutions, laws, and norms to take advantage of it [00:26:09].

Rapid adoption of generative AI has been reported (e.g., 40% usage), but the “intensity of adoption” (e.g., half an hour to three hours per week) is relatively low compared to past technologies like the PC [00:27:39]. This could be because AI is not yet as immediately useful for many people, or due to policy gaps.

Narayanan highlights the need for education on using AI productively and avoiding pitfalls [00:28:46]. Students are often hesitant to use AI, viewing it as a “cheating tool” [00:29:05]. Policies should make it easier for teachers to upskill and teach AI literacy to students, from K-12 to college level [00:29:30].

AI in Education and Inequality

AI is unlikely to fundamentally change education, similar to how online courses didn’t replace classrooms [00:40:16]. The core value of education lies in social preconditions for learning, motivation, connections, caring, and individualized feedback [00:40:56].

For children, AI presents a “really high variance” in outcomes [00:42:09]. For wealthier families with time and resources to monitor usage, AI can be “enormously positive” (e.g., Khan Academy, custom phonics apps, time-telling apps created instantly with Claude’s artifacts feature) [00:42:31]. However, for other children, AI poses risks of addiction and negative impacts, especially as schools are likely to remain “jittery about AI” [00:44:49]. This could exacerbate inequality, where privileged children benefit from personalized learning outside school while others face addiction issues [00:44:50].

The idea of AI as a democratizing force, making luxuries like personal assistants or tutors accessible, is challenged by the need for supervision, especially for children, and potentially high costs for complex queries [00:45:00].

Lessons from Past Technologies

Internet: The internet transformed “almost every cognitive task” but had minimal impact on GDP [00:46:51]. Eliminating one bottleneck often just creates new ones [00:47:29]. Similarly, AI could transform workflows without massive GDP growth [00:47:47].
Industrial Revolution: This era fundamentally transformed the nature of work, moving from manual labor to what we now consider work (e.g., cognitive tasks) [00:47:50]. AI could similarly automate many cognitive tasks, shifting human work towards “AI control” and “AI alignment and safety”—supervising AI due to a lack of trust in its autonomous moral judgments [00:48:20].

Overhyped vs. Underhyped (Quickfire Round)

Overhyped: Agents (“the hype is kind of out of control”) [00:50:01].
Underhyped: “Boring things that are not sexy to talk about but can bring a lot of economic value” [00:50:06], such as AI summarizing C-SPAN meetings [00:50:14].

Future of AI Agents by 2025

Narayanan predicts that by the end of 2025, there will still be “relatively few applications where AI is autonomously doing things for you” [00:51:56], though agentic workflows for generative tasks will continue to increase [00:51:50].

AGI Timeline

Instead of AGI, Narayanan prefers to discuss when AI will have “transformative economic impacts like GDP massive GDP impact” [00:52:12], predicting this is “decades out,” not years away [00:52:19].

Weirdest Prediction

Younger generations will be trained to expect chatbots as the primary way of accessing information, mediated by a “fundamentally statistical tool that could hallucinate” [00:52:37]. This will make traditional search (clicking on websites for authoritative sources) akin to going to a library for future generations – used only if life depends on it, otherwise convenience will prevail [00:53:17].

Conclusion

Arvind Narayanan’s work emphasizes a grounded perspective on AI, advocating for a focus on concrete applications and their societal impacts rather than generalized hype [00:55:50]. He argues that AI’s impact will unfold over decades, much like the internet, rather than in the next few years [00:56:03]. He encourages a “balanced look at both the positives and the negatives of AI” [00:56:38].

Tubegraph

Explorer

Table of Contents