Large language models and selfimprovement

From: aidotengineer

The field of AI agents, extending beyond simple chatbots like ChatGPT, is poised for significant progress, with figures such as Bill Gates and Andrew Ng highlighting its potential for a computing revolution [00:00:26]. Sam Altman even predicts 2025 will be “the year of agent” [00:00:39]. Despite some skepticism regarding the planning capabilities of large language models (LLMs) and the practical limitations of early attempts like AutoGPT [00:00:43], the core concept of AI agents, enabled by LLMs, is gaining traction.

What are AI Agents?

AI agents are not a new concept [00:01:08]. They are defined by a process that involves interacting with an environment through actuations of actions [00:02:41]. The process typically involves several steps:

Perception: Agents must understand their environment by sensing information from various modalities like text, images, audio, video, and touch [00:01:20].
Reasoning/Planning: After perceiving information, agents process it to understand how to complete tasks, break them down into individual steps, and decide on appropriate tools or actions [00:01:37]. This often involves “chain of thought” reasoning, powered by LLMs [00:02:02].
Reflection (Meta-reasoning): Agents can perform meta-reasoning steps by reflecting on executed actions, asking if the right choice was made, and adjusting if necessary [00:02:10].
Action: This involves any performance that interacts with the environment, such as talking to humans or moving objects [00:02:25].

Levels of Autonomy

The deployment of agents can be understood through different levels of autonomy, similar to self-driving cars [00:03:02]:

Level 1 (Chatbot): Simple information retrieval (e.g., 2017 chatbots) [00:03:12].
Level 2 (Agent Assist): LLMs generate suggested responses for human approval (e.g., customer service) [00:03:20].
Level 3 (Agent as a Service): LLMs automate AI workflows for specific services (e.g., meeting bookings, writing job descriptions) [00:03:35].
Level 4 (Autonomous Agents): Agents can delegate and perform multiple, inter-connected tasks, sharing knowledge and resources [00:03:51].
Level 5 (Jarvis-like): Full trust in agents to perform all tasks, including security measures, on behalf of users [00:04:16].

While self-driving cars are high-risk agents where errors are critical [00:04:51], AI agents can be categorized by risk level. Low-risk tasks like filing reimbursements can start with human supervision and gradually build trust, while customer-facing tasks are typically higher risk [00:05:08].

Improving LLM Reasoning and Reflection

Key areas for improving LLMs for agent tasks include enhancing their reasoning and reflection abilities, eliciting better behaviors from existing models, and learning from past examples to optimize LLMs for agent tasks [00:05:41].

For mathematical reasoning tasks, two common methods using LLMs are:

Few-shot prompting: Providing examples of similar problems and their answers as context for the LLM [00:06:40].
Chain of thought: Instructing the model to “think step by step,” allowing it to reason over tokens to reach a correct answer [00:07:01].

A more recent method combines these by providing the model with a problem, asking it to solve it step-by-step, then providing the initial answer and prompting the LLM to generate feedback (reflection) on its correctness [00:07:23]. This process of “self-refined” or “self-improvement” involves combining the feedback with the original question to prompt the model again, allowing it to update its answer and internal processes [00:08:12]. This iterative process can be repeated until a correct answer is reached [00:08:46].

Challenges with Smaller LLMs

While effective for larger models (e.g., >7 billion or 13 billion parameters), this self-improvement process poses challenges for smaller LLMs like LLaMA 7 billion [00:08:58]. The feedback generated by smaller models often contains “noise,” which can propagate and degrade correction steps, leading to worse results (“the blind leading the blind”) [00:09:16].

Furthermore, feeding internal logics or demonstrations from larger models (verifiers) to smaller models can lead to incompatibility, rendering the feedback useless [00:09:52]. The feedback needs to be “dumbed down” to cater to the smaller model’s internal logic [00:10:37].

Wiass: Distilling Self-Improvement for Smaller Models

The “Wiass” method addresses these challenges by:

Reformulating the problem to retrain the model with an attempted solution as input, then generating feedback and updates [00:10:57].
Using a larger LLM to edit the smaller model’s feedback, tailoring it to the smaller model [00:11:10]. External tools like Python scripts can also provide more accurate feedback for specific tasks [00:11:29].

In this approach, the smaller model generates an answer and self-reflection, which is then edited by a larger model. This corrected feedback is used as input for the smaller model to generate an updated answer [00:11:37]. This iterative process generates “traces” or “trajectories” of trial and error until the problem is solved correctly [00:12:01]. These filtered trajectories are then used to train the smaller models to perform self-improvement with the guidance of a larger model or other tools [00:12:26].

Results show that this “on-policy self-supervision” is effective. After three iterations, the method achieved 48% accuracy on mathematical problems, outperforming supervised training data [00:13:37]. This indicates that fine-tuning with such self-improvement data truly helps models learn self-improvement [00:13:56], with improvements ranging from 7% to 18% across tasks [00:14:21].

“We can actually improve these models’ reflection or self-learning abilities without explicitly human supervision data. We could use the synthetic ways to generate data policy to help the models to improve.” [00:15:01]

A limitation remains that the effectiveness of this approach is capped by the ceiling of the larger LLM’s (verifier/editor) ability to provide correct feedback [00:15:18].

Eliciting Stronger Model Behavior via Test-Time Scaling

Traditional LLM training relies on compute, data size, and parameter size, following scaling laws [00:16:11]. However, pre-training is resource-intensive, making it difficult for smaller companies or academia [00:17:10].

A new direction, test-time scaling, focuses on eliciting better behaviors from existing pre-trained models by providing them with more steps or budgets during inference [00:17:27]. Techniques like “think step by step” or “do reflection” fall into this category, leading to improved results [00:17:41].

Tree Search for Sequential Decision Making

Complex sequential decision-making tasks, like chess or dialogue, require planning ahead [00:19:39]. Monte Carlo Tree Search (MCTS) is an algorithm that simulates moves and evaluates outcomes multiple times to find stronger actions [00:20:22].

For conversational settings with a clear goal, like a donation persuasion task [00:19:02], a “zero training” MCTS model can be designed by prompting an LLM (e.g., ChatGPT) to:

Search potential: Act as the policy to propose promising actions [00:21:26].
Simulate action outcomes: Predict outcomes if a persuasion strategy is used [00:21:33].
Evaluate action quality: Assess the quality of actions after interacting with the environment [00:21:43].
Simulate opponent behavior: Use another LLM to simulate user responses, given conversational history [00:22:03].

Unlike closed-loop MCTS, open-loop MCTS is used for conversational tasks to account for the stochastic nature and variance of human responses [00:23:03].

“GD P0,” a proposed method, has shown to generate more competitive results without explicit training [00:23:57]. Evaluations by LLMs (e.g., ChatGPT) and human studies (Mechanical Turk) confirm that models using this planning algorithm achieve higher donation rates and are perceived as more convincing, natural, and coherent [00:24:31]. Analysis also reveals that these models can self-discover task information, like avoiding early “big asks” and diversifying persuasion strategies (emotional vs. logical appeals) [00:25:08].

Transferring Policy and Exploratory Learning

The principles of MCTS can be extended to broader AI agent spaces that involve tool use and manipulations beyond just talking [00:26:21].

RMCTS: Reinforcement Learning MCTS

To enable LLMs to perform various AI agent tasks, they need to perceive the world, starting with visual language models (VLMs) that can process images and text [00:26:46]. Traditional VLMs trained on visual question answering (VQA) tasks are different from the action-based visual language models needed for AI agent tasks, which involve executing actions like clicking buttons on a screenshot [00:27:11]. Humans achieve 88% success on tasks like “clear my shopping carts” on benchmarks like Visual Web Arena, while a simple GPT-4V achieves only 16% [00:28:02], highlighting the gap in agent environment interaction data and training [00:28:24].

RMCTS (RL Monte Carlo Tree Search) is an algorithm that augments VLMs with a search algorithm to improve decision-making during test time [00:28:48]. It extends simple MCTS by incorporating:

Contrastive Reflection: Allows agents to learn from past interactions and dynamically improve search efficiency, saving experiences to a vector database for future retrieval [00:29:11]. This creates a “memory module” for the system [00:29:39].
Multi-agent Debate: Improves the robustness of state evaluation by having models debate the quality of actions (e.g., “why is this a good/bad action?”), counterbalancing biases from single prompts [00:29:21].

RMCTS builds a search tree dynamically and uses multi-agent debate for reliable state estimation [00:31:07]. After each task, it performs contrastive self-reflection to improve future execution [00:31:26].

Evaluated on benchmarks like Visual Web Arena (browsing tasks) and OS World (computer desktop tasks in Linux) [00:31:43], RMCTS outperforms other search algorithms and non-search methods [00:32:18]. It leads the Visual Web Arena leaderboard and is the best non-trained method in OS World [00:33:05], demonstrating significant performance gains without additional human supervision [00:32:50].

Exploratory Learning

The knowledge obtained through these search processes can be transferred into the training of the base LLM [00:33:33]. Instead of imitation learning (directly training on the best action found), exploratory learning treats the tree search process as a single trajectory [00:33:48]. The model learns how to linearize search tree traversal to understand how to explore, backtrack, and evaluate [00:34:03]. This teaches the model to improve its decision processes autonomously [00:34:21].

Exploratory learning outperforms imitation learning, especially with constrained test-time budgets, by enabling the agent to learn from scenarios where an action didn’t lead to the desired state, prompting it to backtrack and try other actions [00:34:51].

Future Directions and Frameworks

Reflective MCTS models and exploratory learning demonstrate that significant improvements in agent performance can be achieved through test-time compute scaling and retraining from refined or backtracked exploratory traces, without relying on human supervised information [00:35:27].

Future work includes developing better RL methods to reduce reliance on the search tree and model predictive control to lessen expensive environment setups [00:35:56]. Ongoing efforts focus on improving control abilities and autonomous exploration within agent orchestration layers [00:36:29].

The ARCollex AI open-source agent framework offers features like continuous learning and task decomposition, providing developers flexibility [00:36:40]. Advancing AI agents requires a multidisciplinary approach, combining expertise from machine learning, systems, human-computer interaction (HCI), and security [00:37:06].

Current benchmarks typically focus on single agents performing single tasks [00:37:25]. Future challenges include:

Multi-tasking: How one human can assign multiple tasks to a model on the same computer, leading to system-level problems like scheduling [00:37:40].
Database interaction: Avoiding side effects and ensuring better security, including knowing when to trigger human handovers or supervision requests [00:37:51].
Multi-user, Multi-agent planning: Complex scenarios involving multiple humans interacting with different agents, requiring realistic benchmarks that consider efficiency and security in addition to task completion [00:38:04].

Tubegraph

Explorer

Table of Contents