From: hu-po
AI Agents are a significant area of research, with recent advancements focusing on Large Language Models (LLMs) that utilize tools and operate in iterative loops [02:53:00]. While the field has seen considerable hype, including demonstrations from startups like Devin, current benchmarks indicate that the technology is “not quite there yet” [04:39:00].
Current State and Challenges
Recent controversies have highlighted that some highly publicized demonstrations of AI agents, such as Devin, have been criticized for being “faked” or misleading [04:15:00]. Issues identified include:
- The problem solved in the demo not matching stated requirements [04:17:00].
- Editing non-existent files or fixing nonsensical errors [04:24:24].
- Tasks appearing quick in the demo but stretching over many hours or even days in reality [04:35:00].
Benchmarking papers, such as the OS World benchmark, show that while humans achieve over 72% success on tasks, the best AI models achieve only 12% success, struggling with GUI grounding and operational knowledge [05:57:00]. Similarly, the WebArena benchmark reported that the best GPT-4 agent achieved only a 14% end-to-end success rate compared to human performance of 78% [05:25:00]. This indicates that agents are often “oversold” [06:07:00].
Core Components of LLM-based Agents
AI agent architectures, especially those discussed in survey papers like “A Survey on Large Language Model-based Game Agents,” share common conceptual components [08:48:00]:
- Perception/Observation State: This input can include images, text, or structured information about the environment [11:11:00]. For operating system tasks, agents might use accessibility tools designed for humans with disabilities to perceive screen elements [21:19:00].
- Memory: Agents often incorporate a “memory bank” or “knowledge bank” for retrieval augmented generation (RAG). This involves storing and retrieving information from a “cold storage” to improve performance on specific tasks [11:17:00].
- Thinking/Reasoning: This involves using multiple auto-regressive steps to generate internal context or “Chain of Thought” reasoning before producing a final answer [11:31:00].
- Role-Playing: Agents can be given different “identities” or system prompts, which can lead to varied behaviors and facilitate multi-agent cooperation [12:11:00].
- Action: This refers to the agent’s output, which can be direct commands (like mouse movements or keyboard input) or more abstract actions [12:53:00]. Some benchmarks like OS World emphasize “free-form raw keyboard and mouse control” for a more human-like interaction [22:11:00].
- Learning: This can involve updating memory banks, building successful/unsuccessful trajectories, or using reinforcement learning paradigms [12:57:00].
Reinforcement Learning Influence
Many modern AI agent frameworks are heavily influenced by traditional reinforcement learning (RL) concepts [29:35:00].
Markov Decision Processes (MDPs)
Agent tasks are often formalized as Partially Observable Markov Decision Processes (POMDPs), which involve a state space (S), observation space (O), action space (A), transition function (T), and reward function (R) [29:45:00].
- State: The current configuration of the environment (e.g., the layout of a chess board) [30:38:00].
- Action: An agent’s choice that transitions the environment from one state to another [30:48:00].
- Partially Observable: Means the agent doesn’t have a perfect knowledge of the true state, only an observation [32:05:00].
Natural Language as State and Action Space
A key shift in modern agents is the use of natural language for state, observation, and action spaces, rather than discrete numerical vectors [37:51:00]. This allows for a “fuzzier” and potentially more natural representation, making RL concepts more adaptable to complex, real-world tasks [38:48:00].
Reward Functions and Actor-Critic Frameworks
Concepts like reward functions (which evaluate whether an action or sequence of actions was good or bad) and actor-critic frameworks (where an “actor” chooses actions and a “critic” evaluates their usefulness) are prevalent [47:51:00]. For example, some models use a “demonstration ranker” to predict the success of an action sequence, which functions similarly to an RL reward model [47:25:00]. OpenAI’s Q*
concept is hypothesized to be a Q-function that uses a language model to evaluate state-action pairs in natural language [49:23:00].
Task-Specific Agent Loops
Currently, most successful agent loops are task-specific. Examples include:
- Web Agents (e.g., Wilbur): Designed for navigating and interacting with websites [11:15:00]. They use retrieval augmented generation to select relevant past demonstrations (both successful and unsuccessful) to inform current actions [01:06:58].
- Research Agents: Focused on generating and refining research ideas from scientific literature [01:09:27]. They utilize an “entity-centric knowledge store” for memory [01:10:16].
The common underlying principle in these task-specific agent loops is intelligent prompt engineering and retrieval augmented generation. This involves strategically filtering and selecting information from a “knowledge bank” to populate the LLM’s context window, thereby influencing its behavior without traditional fine-tuning [01:41:51].
Future Outlook: Context Length and Fine-tuning
The increasing context length of LLMs (e.g., Gemini 1.5 with 1 million tokens [01:14:50]) could significantly change how agents are developed and deployed. It’s hypothesized that if context windows become large enough (e.g., 10 million or a billion tokens), fine-tuning might become obsolete [01:15:01]. Instead, an entire “fine-tuning dataset” could be provided directly within the model’s context [01:15:10].
This shift would simplify development by eliminating the need for complex fine-tuning processes and managing numerous specialized models [01:15:30]. Instead, a single, general “foundation agent” (potentially a large Vision Language Model (VLM)) could adapt to various tasks solely through in-context learning [01:59:53]. While this would increase inference costs due to larger prompts, ongoing advancements in hardware and efficiency (e.g., quantization, Mixture of Experts) might mitigate this [01:50:24].
Ultimately, the goal is to move from task-specific agent loops towards a more generic, “foundation agent” that can generalize across a wide range of tasks, mirroring the generality seen in current LLMs themselves [01:13:09].