Selfimprovement and reasoning in AI agents

From: aidotengineer

AI agents are a significant area of progress in artificial intelligence, with figures like Bill Gates calling them the “biggest revolution in computing” [00:00:29] and Sam Altman predicting 2025 as the “year of agent” [00:00:39]. While some skepticism exists regarding their capabilities, particularly in planning beyond simple Large Language Model (LLM) wrappers [00:00:46], recent advancements show promising developments in their self-improvement and reasoning abilities [00:00:58].

What Are AI Agents?

At their core, AI agents are systems that interact with an environment through actuations of actions [00:02:43]. Their conceptual basis is not new but has been significantly amplified by the power of large language models [00:01:11].

The process of an AI agent typically involves four main steps:

Perception: The agent understands its environment by sensing information from various modalities like text, image, audio, video, or touch [00:01:24].
Reasoning: After perceiving information, the agent processes it to understand how to complete tasks, break them down into individual steps, and decide on appropriate tools or actions [00:01:37]. This “inner plan process” is often referred to as Chain of Thought reasoning, powered by LLMs [00:02:02].
Reflection (Meta-reasoning): Agents can perform meta-reasoning steps, asking themselves if previous actions were correct and if they need to adjust their approach [00:02:10].
Actions: The agent performs actions, which can range from talking to a human to moving physically or executing digital commands [00:02:25].

Levels of Autonomy in AI Agents

The deployment of AI agents can be understood through an analogy with levels of autonomy in self-driving cars [00:03:02]:

Level 1 (Chatbot): Simple information retrieval, like a chatbot from 2017 [00:03:12].
Level 2 (Agent Assist): LLMs generate suggested responses for human agents (e.g., customer service), requiring human approval before sending [00:03:20].
Level 3 (Agent as a Service): LLMs automate AI workflows for specific tasks like meeting bookings or writing job descriptions [00:03:35].
Level 4 (Autonomous Agents): An AI can delegate and perform multiple, interconnected tasks, sharing knowledge and resources [00:03:51].
Level 5 (Jarvis-like Agents): Full trust in the agent, delegating all security measures (e.g., keys) to perform actions on behalf of the user [00:04:16].

While self-driving cars are high-risk agents, AI agents can be deployed for low-risk tasks (e.g., filing reimbursements with human supervision) and gradually move towards high-risk, customer-facing tasks as trust is built [00:05:08].

Improving Reasoning and Reflection in AI Agents

Improving the reasoning abilities of LLMs can be approached in two main ways for mathematical reasoning tasks:

Few-shot prompting: Providing examples of similar problems and their answers as context [00:06:40].
Chain of Thought reasoning: Instructing the model to “think step by step” to reach an answer [00:07:01].

A more recent method combines these approaches by incorporating self-reflection or self-improvement [00:07:23]. The model generates an initial answer, then prompts itself to generate feedback on its own answer (e.g., “the part blah blah is incorrect”) [00:07:57]. This feedback, combined with the original question and answer, is used to prompt the model again to update its answer and internal processes [00:08:18]. This “self-refined” process can be iterated multiple times until the correct answer is reached [00:08:43].

Challenges with Smaller LLMs

While effective with larger LLMs, this self-improvement process can be problematic with smaller models (e.g., Llama 7 billion) [00:08:58].

Noise Propagation: Feedback generated by smaller models can contain “noise” that propagates to correction steps, leading to worse results [00:09:16].
Incompatibility: Internal logics and demonstrations from larger LLMs may not be compatible with smaller models, making the feedback useless [00:09:56].

Proposed Solution: Distillation from Larger Models

To enable self-improvement in smaller models, a method called “Wiass” proposes:

Using a smaller model to generate an initial answer and self-feedback [00:11:42].
Employing a larger LLM or external tools (like Python scripts for math tasks) to edit the smaller model’s feedback, tailoring it to the smaller model’s internal logic [00:11:48].
Using this corrected feedback to update the answer iteratively until the problem is solved [00:12:01].

This process generates “traces” of trial and error, which can be filtered and used to train existing smaller models to perform self-improvement with guidance [00:12:29]. On-policy supervised training (generating feedback in real-time as the model changes) showed significant improvement (up to 48% after three iterations) compared to simple supervised fine-tuning [00:13:52].

Takeaway

Models’ reflection and self-learning abilities can be improved using synthetic data generation policies, reducing the reliance on explicit human supervision [00:15:01]. However, the quality of improvement is capped by the ceiling effect of the larger models or verifiers used for editing [00:15:28].

Eliciting Stronger Model Behavior Through Test-Time Scaling

Beyond pre-training, test-time scaling offers a way to elicit better performance from existing pre-trained LLMs by giving them more steps or computational budget during inference [00:17:29]. This includes methods like “think step by step” or applying reflection [00:17:41].

Sequential Decision-Making in Dialogue Tasks

Dialogue tasks, like a donation persuasion scenario, involve sequential decision-making where an agent must strategize and plan its conversational moves ahead of time [00:18:38]. This is analogous to games like chess, where players plan moves by simulating opponent responses [00:19:44].

Monte Carlo Tree Search (MCTS) for Dialogue

Tree search algorithms, common in games like AlphaGo, can be applied to conversational settings to improve decision-making [00:20:20].

Zero-training MCTS: A model is prompted to act as a policy, propose actions, simulate outcomes, and evaluate action quality, updating each action’s quality over time [00:21:12].
User Simulation: Another LLM can be used to simulate user (opponent) behavior, generating diverse responses to proposed strategies [00:22:08].
Open-Loop MCTS: Unlike closed-loop MCTS used in deterministic games, human responses introduce variance, necessitating an open-loop MCTS approach that stochastically samples simulated tests [00:23:03].

In donation persuasion tasks, this MCTS approach (termed GDP0) improved donation rates and led to more convincing, natural, and coherent conversations [00:24:49]. The models learned to self-discover effective strategies, such as avoiding early donation asks and diversifying persuasion tactics (emotional, logical appeals) [00:25:08].

Expanding to Broader AI Agent Tasks

The principles of MCTS and self-improvement can be extended to broader AI agent tasks involving tool use and manipulations, beyond just conversation [00:26:24]. This requires teaching LLMs to perceive the world visually.

Action-Based Visual Language Models

Traditional visual language models (VLMs) are often trained for Visual Question Answering (VQA) tasks [00:27:11]. However, AI agent tasks require VLMs to process computer screenshots and perform actions like clicking buttons to clear a shopping cart [00:27:25]. Standard GPT-4V, without specific planning or agentic training, performs poorly (16% success) on benchmarks like Visual Web Arena [00:28:18], highlighting a lack of agent-environment interaction data.

R-MCTS: RL Monte Carlo Tree Search

A new algorithm, R-MCTS, is introduced to elicit better performance through test-time compute [00:28:48]. It extends MCTS with two key modifications:

Contrastive Reflection with Memory: The system includes a memory module where agents learn from past interactions. After completing a task, contrastive reflection helps the model internalize successes or errors, saving this experience to a vector database [00:29:40]. For future tasks, relevant reflections are retrieved to enhance decision-making [00:30:05].
Multi-Agent Debate for State Evaluation: Instead of single prompts for evaluating progress, a “debate” format is used. Models are asked to argue why an action is good or bad, leading to more robust and balanced evaluations and counteracting biases [00:30:38].

These modifications allow R-MCTS to build a search tree on the fly, improve decisions, and continuously learn from experience through reflection [00:31:07].

Evaluation and Benchmarks

R-MCTS was evaluated on two popular AI agent benchmarks:

Visual Web Arena: Evaluates agents on browsing tasks (e.g., Reddit, shopping) [00:31:54].
OS World: Consists of computer tasks like navigating file systems or using apps (V Code, Excel) in a Linux environment [00:32:06].

R-MCTS consistently outperformed other search algorithms (breadth-first, depth-first, vanilla MCTS) and non-search methods [00:32:21]. It achieved top rankings in both the Visual Web Arena leaderboard and as the best non-trained method in OS World [00:33:08], demonstrating significant performance gains without additional human supervision or fine-tuning of original models [00:32:50].

Transferring Knowledge through Exploratory Learning

The knowledge gained through these search processes at test time can be transferred to the base LLM itself during training [00:33:32].

Exploratory Learning: Unlike imitation learning (direct training on the best action found in the tree), exploratory learning treats the tree search process as a single trajectory [00:33:55]. This teaches the model to linearize the search tree traversal, motivating it to learn how to explore, backtrack, and evaluate its actions [00:34:03].
This approach helps models improve their decision processes by learning to evaluate and backtrack independently [00:34:23].

Future Directions and Frameworks

Continuous improvement in AI agents, especially without reliance on human supervision, is a key area of research [00:35:47]. Future work focuses on:

Reducing Reliance on Search Trees: Developing better methods to minimize the need for computationally expensive search trees [00:35:57].
Model Predictive Control: Implementing methods to reduce the cost of environment setup and interaction [00:36:02].
Improved Control and Autonomous Exploration: Enhancing the agent’s ability to control its actions and explore autonomously within its orchestration layer [00:36:29].

ArXlex Open-Source Agent Framework

The ArXlex open-source agent framework integrates these research advancements, offering features like continuous learning and task decomposition for developers [00:36:40].

Interdisciplinary Challenges

Designing AI agents for practical, real-world deployment requires an interdisciplinary approach, combining machine learning expertise with systems, human-computer interaction (HCI), and security expertise [00:37:11].

Challenges in creating personal AI agents include:

Multi-Agent, Multi-User Scenarios: Current benchmarks often focus on a single agent performing a single task [00:37:27]. Real-world applications demand multiple agents interacting with multiple humans and tasks simultaneously [00:38:04].
System-Level Problems: Issues like scheduling, database interaction to avoid side effects, and robust security measures (e.g., human handover points) become critical [00:37:48].
Realistic Benchmarks: There is a need to establish more realistic benchmarks that consider not only task completion but also efficiency, security, and potential adversarial settings [00:38:25].

Addressing these challenges in AI agent evaluation will provide the foundation for future AI agent applications [00:38:41].

Tubegraph

Explorer

Table of Contents

Selfimprovement and reasoning in AI agents

What Are AI Agents?

Levels of Autonomy in AI Agents

Improving Reasoning and Reflection in AI Agents

Self-Refinement with Large Language Models

Challenges with Smaller LLMs

Proposed Solution: Distillation from Larger Models

Eliciting Stronger Model Behavior Through Test-Time Scaling

Sequential Decision-Making in Dialogue Tasks

Monte Carlo Tree Search (MCTS) for Dialogue

Expanding to Broader AI Agent Tasks

Action-Based Visual Language Models

R-MCTS: RL Monte Carlo Tree Search

Evaluation and Benchmarks

Transferring Knowledge through Exploratory Learning

Future Directions and Frameworks

Interdisciplinary Challenges

Graph View

Backlinks