From: aidotengineer
AI agents are a significant area of progress in artificial intelligence, with figures like Bill Gates calling them the “biggest revolution in computing” [00:00:29] and Sam Altman predicting 2025 as the “year of agent” [00:00:39]. While some skepticism exists regarding their capabilities, particularly in planning beyond simple Large Language Model (LLM) wrappers [00:00:46], recent advancements show promising developments in their self-improvement and reasoning abilities [00:00:58].
What Are AI Agents?
At their core, AI agents are systems that interact with an environment through actuations of actions [00:02:43]. Their conceptual basis is not new but has been significantly amplified by the power of large language models [00:01:11].
The process of an AI agent typically involves four main steps:
- Perception: The agent understands its environment by sensing information from various modalities like text, image, audio, video, or touch [00:01:24].
- Reasoning: After perceiving information, the agent processes it to understand how to complete tasks, break them down into individual steps, and decide on appropriate tools or actions [00:01:37]. This “inner plan process” is often referred to as Chain of Thought reasoning, powered by LLMs [00:02:02].
- Reflection (Meta-reasoning): Agents can perform meta-reasoning steps, asking themselves if previous actions were correct and if they need to adjust their approach [00:02:10].
- Actions: The agent performs actions, which can range from talking to a human to moving physically or executing digital commands [00:02:25].
Levels of Autonomy in AI Agents
The deployment of AI agents can be understood through an analogy with levels of autonomy in self-driving cars [00:03:02]:
- Level 1 (Chatbot): Simple information retrieval, like a chatbot from 2017 [00:03:12].
- Level 2 (Agent Assist): LLMs generate suggested responses for human agents (e.g., customer service), requiring human approval before sending [00:03:20].
- Level 3 (Agent as a Service): LLMs automate AI workflows for specific tasks like meeting bookings or writing job descriptions [00:03:35].
- Level 4 (Autonomous Agents): An AI can delegate and perform multiple, interconnected tasks, sharing knowledge and resources [00:03:51].
- Level 5 (Jarvis-like Agents): Full trust in the agent, delegating all security measures (e.g., keys) to perform actions on behalf of the user [00:04:16].
While self-driving cars are high-risk agents, AI agents can be deployed for low-risk tasks (e.g., filing reimbursements with human supervision) and gradually move towards high-risk, customer-facing tasks as trust is built [00:05:08].
Improving Reasoning and Reflection in AI Agents
Self-Refinement with Large Language Models
Improving the reasoning abilities of LLMs can be approached in two main ways for mathematical reasoning tasks:
- Few-shot prompting: Providing examples of similar problems and their answers as context [00:06:40].
- Chain of Thought reasoning: Instructing the model to “think step by step” to reach an answer [00:07:01].
A more recent method combines these approaches by incorporating self-reflection or self-improvement [00:07:23]. The model generates an initial answer, then prompts itself to generate feedback on its own answer (e.g., “the part blah blah is incorrect”) [00:07:57]. This feedback, combined with the original question and answer, is used to prompt the model again to update its answer and internal processes [00:08:18]. This “self-refined” process can be iterated multiple times until the correct answer is reached [00:08:43].
Challenges with Smaller LLMs
While effective with larger LLMs, this self-improvement process can be problematic with smaller models (e.g., Llama 7 billion) [00:08:58].
- Noise Propagation: Feedback generated by smaller models can contain “noise” that propagates to correction steps, leading to worse results [00:09:16].
- Incompatibility: Internal logics and demonstrations from larger LLMs may not be compatible with smaller models, making the feedback useless [00:09:56].
Proposed Solution: Distillation from Larger Models
To enable self-improvement in smaller models, a method called “Wiass” proposes:
- Using a smaller model to generate an initial answer and self-feedback [00:11:42].
- Employing a larger LLM or external tools (like Python scripts for math tasks) to edit the smaller model’s feedback, tailoring it to the smaller model’s internal logic [00:11:48].
- Using this corrected feedback to update the answer iteratively until the problem is solved [00:12:01].
This process generates “traces” of trial and error, which can be filtered and used to train existing smaller models to perform self-improvement with guidance [00:12:29]. On-policy supervised training (generating feedback in real-time as the model changes) showed significant improvement (up to 48% after three iterations) compared to simple supervised fine-tuning [00:13:52].
Takeaway
Models’ reflection and self-learning abilities can be improved using synthetic data generation policies, reducing the reliance on explicit human supervision [00:15:01]. However, the quality of improvement is capped by the ceiling effect of the larger models or verifiers used for editing [00:15:28].
Eliciting Stronger Model Behavior Through Test-Time Scaling
Beyond pre-training, test-time scaling offers a way to elicit better performance from existing pre-trained LLMs by giving them more steps or computational budget during inference [00:17:29]. This includes methods like “think step by step” or applying reflection [00:17:41].
Sequential Decision-Making in Dialogue Tasks
Dialogue tasks, like a donation persuasion scenario, involve sequential decision-making where an agent must strategize and plan its conversational moves ahead of time [00:18:38]. This is analogous to games like chess, where players plan moves by simulating opponent responses [00:19:44].
Monte Carlo Tree Search (MCTS) for Dialogue
Tree search algorithms, common in games like AlphaGo, can be applied to conversational settings to improve decision-making [00:20:20].
- Zero-training MCTS: A model is prompted to act as a policy, propose actions, simulate outcomes, and evaluate action quality, updating each action’s quality over time [00:21:12].
- User Simulation: Another LLM can be used to simulate user (opponent) behavior, generating diverse responses to proposed strategies [00:22:08].
- Open-Loop MCTS: Unlike closed-loop MCTS used in deterministic games, human responses introduce variance, necessitating an open-loop MCTS approach that stochastically samples simulated tests [00:23:03].
In donation persuasion tasks, this MCTS approach (termed GDP0) improved donation rates and led to more convincing, natural, and coherent conversations [00:24:49]. The models learned to self-discover effective strategies, such as avoiding early donation asks and diversifying persuasion tactics (emotional, logical appeals) [00:25:08].
Expanding to Broader AI Agent Tasks
The principles of MCTS and self-improvement can be extended to broader AI agent tasks involving tool use and manipulations, beyond just conversation [00:26:24]. This requires teaching LLMs to perceive the world visually.
Action-Based Visual Language Models
Traditional visual language models (VLMs) are often trained for Visual Question Answering (VQA) tasks [00:27:11]. However, AI agent tasks require VLMs to process computer screenshots and perform actions like clicking buttons to clear a shopping cart [00:27:25]. Standard GPT-4V, without specific planning or agentic training, performs poorly (16% success) on benchmarks like Visual Web Arena [00:28:18], highlighting a lack of agent-environment interaction data.
R-MCTS: RL Monte Carlo Tree Search
A new algorithm, R-MCTS, is introduced to elicit better performance through test-time compute [00:28:48]. It extends MCTS with two key modifications:
- Contrastive Reflection with Memory: The system includes a memory module where agents learn from past interactions. After completing a task, contrastive reflection helps the model internalize successes or errors, saving this experience to a vector database [00:29:40]. For future tasks, relevant reflections are retrieved to enhance decision-making [00:30:05].
- Multi-Agent Debate for State Evaluation: Instead of single prompts for evaluating progress, a “debate” format is used. Models are asked to argue why an action is good or bad, leading to more robust and balanced evaluations and counteracting biases [00:30:38].
These modifications allow R-MCTS to build a search tree on the fly, improve decisions, and continuously learn from experience through reflection [00:31:07].
Evaluation and Benchmarks
R-MCTS was evaluated on two popular AI agent benchmarks:
- Visual Web Arena: Evaluates agents on browsing tasks (e.g., Reddit, shopping) [00:31:54].
- OS World: Consists of computer tasks like navigating file systems or using apps (V Code, Excel) in a Linux environment [00:32:06].
R-MCTS consistently outperformed other search algorithms (breadth-first, depth-first, vanilla MCTS) and non-search methods [00:32:21]. It achieved top rankings in both the Visual Web Arena leaderboard and as the best non-trained method in OS World [00:33:08], demonstrating significant performance gains without additional human supervision or fine-tuning of original models [00:32:50].
Transferring Knowledge through Exploratory Learning
The knowledge gained through these search processes at test time can be transferred to the base LLM itself during training [00:33:32].
- Exploratory Learning: Unlike imitation learning (direct training on the best action found in the tree), exploratory learning treats the tree search process as a single trajectory [00:33:55]. This teaches the model to linearize the search tree traversal, motivating it to learn how to explore, backtrack, and evaluate its actions [00:34:03].
- This approach helps models improve their decision processes by learning to evaluate and backtrack independently [00:34:23].
Future Directions and Frameworks
Continuous improvement in AI agents, especially without reliance on human supervision, is a key area of research [00:35:47]. Future work focuses on:
- Reducing Reliance on Search Trees: Developing better methods to minimize the need for computationally expensive search trees [00:35:57].
- Model Predictive Control: Implementing methods to reduce the cost of environment setup and interaction [00:36:02].
- Improved Control and Autonomous Exploration: Enhancing the agent’s ability to control its actions and explore autonomously within its orchestration layer [00:36:29].
ArXlex Open-Source Agent Framework
The ArXlex open-source agent framework integrates these research advancements, offering features like continuous learning and task decomposition for developers [00:36:40].
Interdisciplinary Challenges
Designing AI agents for practical, real-world deployment requires an interdisciplinary approach, combining machine learning expertise with systems, human-computer interaction (HCI), and security expertise [00:37:11].
Challenges in creating personal AI agents include:
- Multi-Agent, Multi-User Scenarios: Current benchmarks often focus on a single agent performing a single task [00:37:27]. Real-world applications demand multiple agents interacting with multiple humans and tasks simultaneously [00:38:04].
- System-Level Problems: Issues like scheduling, database interaction to avoid side effects, and robust security measures (e.g., human handover points) become critical [00:37:48].
- Realistic Benchmarks: There is a need to establish more realistic benchmarks that consider not only task completion but also efficiency, security, and potential adversarial settings [00:38:25].
Addressing these challenges in AI agent evaluation will provide the foundation for future AI agent applications [00:38:41].