Large language models in AI agents

Introduction to AI Agents

AI agents are a significant area of progress in artificial intelligence. Prominent figures like Bill Gates view them as the “biggest revolution in computing,” while Andrew Ng considers them “massive AI progress.” Sam Altman of OpenAI anticipates 2025 to be “the year of agent” [00:00:24].

Despite this optimism, there are also critical voices, with some arguing that current AI agents are merely “simple wrappers of large language models” that “really can’t plan” [00:00:43]. Concerns are also raised about tools like AutoGPT not providing practical solutions [00:00:54].

What are AI Agents?

AI agents are not a new concept, but the emergence of large language models (LLMs) has significantly enhanced their capabilities [00:01:08]. An AI agent generally operates through a cyclical process:

Perception: Agents sense information from their environment through various modalities like text, image, audio, video, and touch, much like humans understand the world [00:01:17].
Reasoning: After perceiving information, agents process it to understand how to complete tasks, break them down into individual steps, and identify appropriate tools or actions. This “inner plan” process is often referred to as “chain of thoughts reasoning,” powered by large language models [00:01:37].
Reflection: Agents can perform meta-reasoning steps, asking themselves if their actions were correct and if adjustments are needed. This self-correction process is known as “reflection” or “self-improvement” [00:02:10].
Actions: The final step involves performing actions, which could range from talking to humans, moving physically, or interacting with a digital environment [00:02:23]. Agents interact with their environment through these actuations of actions [00:02:41].

Levels of Autonomy

The deployment of AI agents can be understood through an analogy with the levels of autonomy in self-driving cars [00:03:02]:

Level 1: Chatbot (2017): The basic level, where an agent can only retrieve information [00:03:12].
Level 2: Agent Assist: An LLM generates suggested responses for tasks, but a human must approve the final message, such as in customer service [00:03:20].
Level 3: Agent as a Service: Large language models automate AI workflows and are used as a service, for example, for booking meetings or writing job descriptions [00:03:35].
Level 4: Autonomous Agents: An AI can delegate and perform multiple tasks simultaneously, sharing components, knowledge, and resources among them [00:03:51].
Level 5: J.A.R.V.I.S. (Iron Man-like): Full trust is placed in the agent, delegating all security measures and allowing it to perform tasks on behalf of the user [00:04:16].

Tasks for AI agents can be categorized by risk. Low-risk tasks, such as filing reimbursements in a back office, allow for human supervision and can be automated over time as trust is built. High-risk, customer-facing tasks require more caution, with a progression expected from back-office to front-office deployments over time [00:05:06].

Improving Large Language Models for AI Agent Tasks

To advance AI agents, several key areas of improvement for large language models are being explored [00:05:41]:

Enhancing Reasoning and Reflection: Focus on making LLMs better at internal thought processes and self-correction [00:05:47].
Eliciting Better Behaviors: Optimizing existing LLMs to perform better on AI agent tasks without extensive retraining [00:05:52].
Learning from Traces: Utilizing generated examples and trajectories to further refine LLMs for specific agent tasks [00:06:00].

Self-Improvement and Reflection

One method involves using a process of “reflection” or “self-improvement” [00:08:12]. In a mathematical reasoning task, an LLM can generate an initial answer, then generate feedback on its own answer (reflection), and finally use this feedback to refine the answer. This iterative process can be repeated until a correct answer is reached [00:07:41].

However, smaller LLMs (e.g., Llama 7 billion) can struggle with this, generating “noise” in their feedback that propagates into corrections, leading to worse results [00:08:58]. The internal logic of larger models may also be incompatible with smaller ones, rendering their feedback useless [00:09:52].

To address this, a proposed method called “Wiass” involves:

Using a smaller model to generate an initial answer and self-feedback [00:11:40].
Employing a larger LLM or external tools (like Python scripts for math tasks) to edit the smaller model’s feedback, making it tailored and more accurate [00:11:18].
Using this corrected feedback to update the answer, iterating until the problem is solved [00:11:55].

By collecting and filtering these trial-and-error trajectories, smaller models can be trained to perform self-improvement with guidance from larger models or tools [00:12:26]. This approach, known as “on-policy self-supervision,” has shown significant improvements in mathematical reasoning tasks [00:13:50].

Eliciting Stronger Model Behavior

Beyond pre-training, test-time scaling allows LLMs to achieve better results by providing more steps or computational budget during inference [00:17:27]. This includes methods like “thinking step-by-step” or performing “reflection” [00:17:39].

Tree Search for Dialog Tasks

For sequential decision-making tasks like dialogues, principles from chess (e.g., AlphaGo) can be applied using “tree search” [00:19:44]. This involves:

Proposing a move or action [00:20:27].
Simulating potential outcomes [00:20:30].
Evaluating the quality of the action [00:20:36].
Repeating this process to find the strongest move [00:20:40].

In a conversational setting, a “zero-training” Monte Carlo Tree Search (MCTS) model can be designed by prompting LLMs (e.g., ChatGPT) to act as policy, simulate action outcomes, and evaluate action quality [00:21:15]. To account for human variability, an “open-loop MCTS” is used, where the model stochastically samples simulated conversational strategies [00:23:10].

This approach, exemplified by “GDP0,” has shown to generate more persuasive, natural, and coherent dialogues without human training data, leading to better donation conversion rates in persuasion tasks [00:24:02]. It allows models to self-discover task information, such as delaying the “big ask” and diversifying persuasion strategies [00:25:08].

Reflective Monte Carlo Tree Search (RMCTS) for Diverse Agent Tasks

To extend these capabilities to broader AI agent tasks beyond dialogue, such as tool use and manipulation, a method called “RL MCTS” (Reflective Monte Carlo Tree Search) was introduced [00:28:48]. This search algorithm explores vast action spaces and improves decision-making by incrementally building a search tree [00:28:54].

Key enhancements of RMCTS include:

Contrastive Reflection: Agents learn from past interactions (successes and errors) and dynamically improve search efficiency. This experience is saved to a vector database, caching learned knowledge for future tasks [00:29:11].
Multi-Agent Debate for State Evaluation: Instead of a single prompt, two LLMs debate the quality of an action (why it’s good or bad) to provide a more robust and holistic evaluation, counterbalancing biases [00:29:21].

RMCTS has been evaluated on benchmarks like Visual Web Arena (browser-based tasks like shopping or Reddit) and OS World (computer desktop tasks like navigating file systems or using V code) [00:31:43]. It significantly outperforms other search algorithms and non-search methods, demonstrating that augmenting visual large language models with search algorithms can improve performance without additional human supervision [00:32:21].

Exploratory Learning

The knowledge obtained through these search processes can be transferred into the training process of the base large language model through “exploratory learning” [00:33:31]. Unlike imitation learning (direct transfer of best actions), exploratory learning treats the tree search process as a single trajectory. The model learns how to linearize the search tree traversal, motivating it to explore, backtrack, and evaluate its actions [00:33:55]. This approach teaches the model to improve its decision processes autonomously [00:34:16].

For example, given a task like “find the recent coffee maker with a touchscreen, then comment ‘grade item’,” the agent learns to perform an action, realize it didn’t lead to the desired state, then backtrack and try other actions [00:34:56].

Future Directions

Current research focuses on:

Improving RL methods to reduce reliance on the search tree [00:35:57].
Developing model predictive control methods to reduce the cost of environment setup and interaction [00:36:02].
Enhancing control abilities and autonomous exploration for agent orchestration layers [00:36:29].

The open-source ArCollex AI agent framework is being developed to integrate these research advancements, offering features like continuous learning and task decomposition [00:36:40]. Advancing AI agents requires a multidisciplinary approach, combining machine learning expertise with systems, HCI, and security expertise [00:37:06].

Future challenges include:

Multi-task, Single-Agent Scenarios: How one agent can perform multiple tasks on the same computer efficiently, addressing scheduling, database interaction, and human handover [00:37:40].
Multi-User, Multi-Agent Planning: The complexity of multiple humans interacting with different agents, assigning tasks, and managing potential adversarial settings [00:38:04].

Establishing more realistic benchmarks with system integrations and algorithms that consider not just task completion but also efficiency and security will be crucial for future applications [00:38:25].

Tubegraph

Explorer

Table of Contents