Task planning and decisionmaking in AI agents

Introduction to AI Agents

The concept of AI agents has gained significant traction, with figures like Bill Gates hailing it as a major revolution in computing and Andrew Ng noting it as a massive AI progress [00:00:26]. Sam Altman of OpenAI even projected 2025 as the “year of agent” [00:00:36]. While some critics argue that these are merely “thin wrappers” around Large Language Models (LLMs) with limited planning capabilities, the core idea of AI agents is not new, but has been significantly empowered by LLMs [00:00:43].

Components of AI Agents

AI agents, like humans, operate through several key steps:

Perception: Understanding the environment by sensing information such as text, image, audio, video, or touch [00:01:20].
Reasoning: Processing information to understand tasks, break them into individual steps, and identify appropriate tools or actions. This often involves an “inner plan” or “chain of thoughts” process, powered by LLMs [00:01:37].
Reflection (Meta-reasoning): A step where the agent evaluates its executed actions and decides whether to adjust or go back if a choice was incorrect [00:02:10].
Actions: Any performance by the agent, such as talking to humans, moving between points, or utilizing tools [00:02:23].

Generally, agents interact with their environment through the actuation of actions [00:02:41].

Levels of Autonomy

The deployment of AI agents can be understood through an analogy to the levels of autonomy in self-driving cars [00:03:02]:

Level 1 (Chatbot): Basic information retrieval, akin to early chatbots from 2017 [00:03:12].
Level 2 (Agent Assist): LLMs generate suggested responses, but human approval is still required, like a customer service agent [00:03:20].
Level 3 (Agent as a Service): LLMs automate specific workflows, acting as a service (e.g., meeting bookings, writing job descriptions) [00:03:35].
Level 4 (Autonomous Agents): The AI agent can delegate and perform multiple, inter-connected tasks, sharing knowledge and resources [00:03:51].
Level 5 (Jarvis/Iron Man): Full trust in the agent, delegating all security measures like keys, allowing the agent to perform fully on its behalf [00:04:16].

While self-driving cars represent a high-risk application of agents (due to potential for loss of life), AI agents can be separated into low-risk and high-risk tasks [00:05:06]. Low-risk tasks, such as back-office operations like filing reimbursements, can benefit from human supervision initially and gain trust over time for full automation [00:05:12]. Customer-facing tasks are generally considered higher risk [00:05:28].

Improving LLM Reasoning and Reflection for AI Agents

To enhance AI agents, research focuses on improving LLMs’ reasoning and reflection capabilities [00:05:41].

Mathematical reasoning tasks often employ two main prompting methods [00:06:26]:

Few-shot prompting: Providing examples of similar problems and their answers as context to the LLM [00:06:40].
Chain of Thought (CoT): Instructing the model to “think step by step,” allowing it to reason over tokens to reach a correct answer [00:07:01].

More recent methods combine these, providing the LLM with a question and its initial answer, then prompting it to generate feedback (reflection) on its own answer [00:07:23]. This feedback, combined with the original question and answer, is used to re-prompt the model, allowing it to update its answer and internal processes. This iterative process is called self-refinement or self-improvement [00:08:12].

Challenges with Smaller LLMs

While effective, self-improvement processes can face problems with smaller LLMs (e.g., Llama 7 billion). Their generated feedback may contain “noise,” which can propagate through correction steps, leading to worse results (“the blind leading the blind”) [00:08:58]. Additionally, the internal logic or “demonstrations” of larger LLMs (e.g., GPT-4) may be incompatible with smaller models, rendering their feedback useless [00:09:52].

Distillation for Smaller Models

To address these challenges and enable smaller LLMs to self-improve, a method involves:

Reformulating the problem: Retraining the model with its attempt as input, then generating feedback and updates [00:10:57].
Using larger LLMs as editors/verifiers: Instead of having smaller models generate their own feedback, a larger LLM or external tools (like Python scripts for math tasks) can edit the smaller model’s feedback, tailoring it to the smaller model’s internal logic [00:11:10]. This corrected feedback guides the smaller model’s updated answer [00:11:55].
Iterative Correction: This correction process can be iterated until the problem is solved correctly, especially for mathematical problems with ground truth [00:12:04].
On-policy training: Generating these feedback traces and filtering successful trajectories to fine-tune the smaller models. This on-policy self-supervision proves more useful than simply fine-tuning on gold answers [00:12:50].

This approach demonstrates that models’ reflection and self-learning abilities can be improved using synthetic data policies, reducing the reliance on explicit human supervision [00:15:01].

Eliciting Stronger Model Behavior through Test-Time Scaling

Beyond pre-training, which is resource-intensive, a new direction called test-time scaling allows eliciting better behaviors from existing pre-trained LLMs [00:17:23]. This involves providing the model with more steps or computational budget during inference, such as instructing it to “think step by step” or perform reflection [00:17:33].

Tree Search for Sequential Decision-Making

Sequential decision-making tasks, such as dialogue or persuasion, benefit significantly from look-ahead planning [00:18:38]. The analogy to chess is strong: thinking multiple moves ahead and simulating opponent responses to determine the best strategy [00:19:44].

Algorithms like Monte Carlo Tree Search (MCTS), popularized by AlphaGo, apply this principle [00:20:20]. The MCTS process involves:

Proposing a move/action: Selecting a promising action [00:20:27].
Simulating outcomes: Projecting the changes and values after taking the action [00:20:30].
Evaluating action quality: Assessing the outcomes of the action after interacting with the environment [00:20:36].
Updating quality: Iteratively refining the action quality over time [00:21:55].

In conversational settings, MCTS can be applied without specific training data (Zero-training approach) by prompting LLMs to act as different components of the search [00:21:12]. An LLM can be prompted to propose actions (policy), simulate action outcomes, and evaluate action quality [00:21:26]. For self-play, another LLM can simulate the opponent’s behavior [00:22:03].

For tasks with human responses, which introduce variance, an Open-Loop MCTS is used. This stochastically samples different possible simulated paths, allowing the agent to anticipate and strategize against varied human reactions [00:23:08]. Evaluations for conversational tasks like donation persuasion have shown that models utilizing MCTS produce more convincing, natural, and coherent interactions, leading to better outcomes [00:24:51]. These models can even self-discover effective strategies, such as avoiding early donation asks and diversifying persuasion tactics [00:25:08].

Transferring Policy and Improving Base Models for Broader AI Agent Tasks

While Monte Carlo Tree Search is effective for specific tasks like dialogue, the goal is to extend it to larger AI agent spaces that involve tool use and manipulation [00:26:24].

Adapting Visual Language Models for Action-Based Tasks

Traditional Visual Language Models (VLMs) are often trained for Visual Question Answering (VQA), e.g., “What is he doing?” [00:27:08]. However, AI agent tasks require action-based understanding, such as interpreting a computer screenshot and performing actions like “clear my shopping cart” by clicking buttons [00:27:22]. There is a significant gap between VQA and agentic tasks in terms of environmental interaction data and training [00:28:23].

RL-MCTS for Test-Time Compute

To address this gap without retraining large models like GPT-4, a new algorithm called Reflective Monte Carlo Tree Search (RL-MCTS) was introduced [00:28:44]. RL-MCTS is a search algorithm that:

Explores vast action spaces and improves decision-making by incrementally constructing a search tree [00:29:01].
Incorporates contrastive reflection, allowing agents to learn from past interactions and dynamically improve search efficiency [00:29:11].
Utilizes a memory module (vector database) to cache learned experiences (successes and errors) from completed tasks, which are then retrieved for future tasks to enhance decision-making [00:29:39].
Employs multi-agent debate for more robust and reliable state evaluation, counteracting biases from a single model’s prompt [00:30:21].

Through these modifications, RL-MCTS (a test-time compute scaling method) significantly improves performance on popular AI agent benchmarks like Visual Web Arena (browsing tasks) and OS World (Linux computer tasks) without additional human supervision [00:31:40]. It outperforms other search algorithms and non-search methods [00:32:21].

Exploratory Learning for Base Model Improvement

The knowledge obtained through search algorithms like RL-MCTS can also be transferred to the training process of base LLMs [00:33:30]. Instead of simple imitation learning (direct transfer of the best action found), exploratory learning treats the tree search process as a single trajectory [00:33:55]. The model learns how to linearize the search tree traversal, motivating it to learn how to explore, backtrack, and evaluate its actions [00:34:03]. This approach enables the model to improve its decision processes autonomously [00:34:21].

Future Directions and Challenges

The reflective MCTS models relying on test-time compute scaling can significantly enhance AI agent performance [00:35:27]. Future work includes:

Reducing reliance on search trees: Developing better methods that don’t depend entirely on extensive search [00:35:57].
Model Predictive Control: Minimizing expensive environment setup and interaction costs [00:36:02].
Improving Control and Autonomous Exploration: Integrating these capabilities within an agent orchestration layer [00:36:29].

Addressing these advancements requires combining expertise from machine learning, systems, human-computer interaction (HCI), and security [00:37:06]. Current benchmarks primarily focus on single agents performing single tasks [00:37:25]. Future challenges and research directions include:

System-level problems: Handling multiple tasks concurrently on the same computer, scheduling, database interactions, and security measures (e.g., human handover, supervision requests) [00:37:48].
Multi-user, Multi-agent planning: Managing scenarios where multiple humans interact with multiple agents and assign different tasks, leading to complex, diverse settings [00:38:04].

Establishing more realistic benchmarks with system integrations and algorithms that consider not only task completion but also efficiency and security will form the foundation for future AI agent applications [00:38:25].

Tubegraph

Explorer

Table of Contents