From: hu-po

AI Agents are a significant area of research, with recent advancements focusing on Large Language Models (LLMs) that utilize tools and operate in iterative loops [02:53:00]. While the field has seen considerable hype, including demonstrations from startups like Devin, current benchmarks indicate that the technology is “not quite there yet” [04:39:00].

Current State and Challenges

Recent controversies have highlighted that some highly publicized demonstrations of AI agents, such as Devin, have been criticized for being “faked” or misleading [04:15:00]. Issues identified include:

  • The problem solved in the demo not matching stated requirements [04:17:00].
  • Editing non-existent files or fixing nonsensical errors [04:24:24].
  • Tasks appearing quick in the demo but stretching over many hours or even days in reality [04:35:00].

Benchmarking papers, such as the OS World benchmark, show that while humans achieve over 72% success on tasks, the best AI models achieve only 12% success, struggling with GUI grounding and operational knowledge [05:57:00]. Similarly, the WebArena benchmark reported that the best GPT-4 agent achieved only a 14% end-to-end success rate compared to human performance of 78% [05:25:00]. This indicates that agents are often “oversold” [06:07:00].

Core Components of LLM-based Agents

AI agent architectures, especially those discussed in survey papers like “A Survey on Large Language Model-based Game Agents,” share common conceptual components [08:48:00]:

  • Perception/Observation State: This input can include images, text, or structured information about the environment [11:11:00]. For operating system tasks, agents might use accessibility tools designed for humans with disabilities to perceive screen elements [21:19:00].
  • Memory: Agents often incorporate a “memory bank” or “knowledge bank” for retrieval augmented generation (RAG). This involves storing and retrieving information from a “cold storage” to improve performance on specific tasks [11:17:00].
  • Thinking/Reasoning: This involves using multiple auto-regressive steps to generate internal context or “Chain of Thought” reasoning before producing a final answer [11:31:00].
  • Role-Playing: Agents can be given different “identities” or system prompts, which can lead to varied behaviors and facilitate multi-agent cooperation [12:11:00].
  • Action: This refers to the agent’s output, which can be direct commands (like mouse movements or keyboard input) or more abstract actions [12:53:00]. Some benchmarks like OS World emphasize “free-form raw keyboard and mouse control” for a more human-like interaction [22:11:00].
  • Learning: This can involve updating memory banks, building successful/unsuccessful trajectories, or using reinforcement learning paradigms [12:57:00].

Reinforcement Learning Influence

Many modern AI agent frameworks are heavily influenced by traditional reinforcement learning (RL) concepts [29:35:00].

Markov Decision Processes (MDPs)

Agent tasks are often formalized as Partially Observable Markov Decision Processes (POMDPs), which involve a state space (S), observation space (O), action space (A), transition function (T), and reward function (R) [29:45:00].

  • State: The current configuration of the environment (e.g., the layout of a chess board) [30:38:00].
  • Action: An agent’s choice that transitions the environment from one state to another [30:48:00].
  • Partially Observable: Means the agent doesn’t have a perfect knowledge of the true state, only an observation [32:05:00].

Natural Language as State and Action Space

A key shift in modern agents is the use of natural language for state, observation, and action spaces, rather than discrete numerical vectors [37:51:00]. This allows for a “fuzzier” and potentially more natural representation, making RL concepts more adaptable to complex, real-world tasks [38:48:00].

Reward Functions and Actor-Critic Frameworks

Concepts like reward functions (which evaluate whether an action or sequence of actions was good or bad) and actor-critic frameworks (where an “actor” chooses actions and a “critic” evaluates their usefulness) are prevalent [47:51:00]. For example, some models use a “demonstration ranker” to predict the success of an action sequence, which functions similarly to an RL reward model [47:25:00]. OpenAI’s Q* concept is hypothesized to be a Q-function that uses a language model to evaluate state-action pairs in natural language [49:23:00].

Task-Specific Agent Loops

Currently, most successful agent loops are task-specific. Examples include:

  • Web Agents (e.g., Wilbur): Designed for navigating and interacting with websites [11:15:00]. They use retrieval augmented generation to select relevant past demonstrations (both successful and unsuccessful) to inform current actions [01:06:58].
  • Research Agents: Focused on generating and refining research ideas from scientific literature [01:09:27]. They utilize an “entity-centric knowledge store” for memory [01:10:16].

The common underlying principle in these task-specific agent loops is intelligent prompt engineering and retrieval augmented generation. This involves strategically filtering and selecting information from a “knowledge bank” to populate the LLM’s context window, thereby influencing its behavior without traditional fine-tuning [01:41:51].

Future Outlook: Context Length and Fine-tuning

The increasing context length of LLMs (e.g., Gemini 1.5 with 1 million tokens [01:14:50]) could significantly change how agents are developed and deployed. It’s hypothesized that if context windows become large enough (e.g., 10 million or a billion tokens), fine-tuning might become obsolete [01:15:01]. Instead, an entire “fine-tuning dataset” could be provided directly within the model’s context [01:15:10].

This shift would simplify development by eliminating the need for complex fine-tuning processes and managing numerous specialized models [01:15:30]. Instead, a single, general “foundation agent” (potentially a large Vision Language Model (VLM)) could adapt to various tasks solely through in-context learning [01:59:53]. While this would increase inference costs due to larger prompts, ongoing advancements in hardware and efficiency (e.g., quantization, Mixture of Experts) might mitigate this [01:50:24].

Ultimately, the goal is to move from task-specific agent loops towards a more generic, “foundation agent” that can generalize across a wide range of tasks, mirroring the generality seen in current LLMs themselves [01:13:09].