From: aidotengineer
AI agents represent a significant evolution in computing, moving beyond the capabilities of simple chatbots to perform complex tasks autonomously [00:00:08]. This field has garnered considerable attention, with figures like Bill Gates describing it as “the biggest revolution in computing” [00:00:26] and Andrew Ng noting it as “massive AI progress” [00:00:30]. Sam Altman from OpenAI has even suggested that 2025 could be “the year of agent” [00:00:36].
Despite the enthusiasm, there are also criticisms, with some viewing agents as mere wrappers around large language models (LLMs) that struggle with planning and practical solutions, as seen with issues like AutoGPT [00:00:43]. However, the core concept of agents is not new; it is the enhanced power derived from LLMs that has made them more capable today [00:01:08].
What is an AI Agent?
An AI agent is a system that interacts with its environment through a cyclical process involving:
- Perception: Agents sense information from their environment through various modalities like text, images, audio, video, and touch [00:01:20].
- Reasoning: This involves processing perceived information, breaking down tasks into individual steps, utilizing environmental inputs, and determining appropriate tools or actions. This inner planning process is often referred to as “chain of thought” reasoning, powered by LLMs [00:01:37]. Agents can also perform “meta-reasoning” steps, such as “reflection,” where they evaluate past actions and adjust their approach if necessary [00:02:10].
- Action: Agents perform actions based on their reasoning, which can include talking to humans, moving between locations, or interacting with digital interfaces [00:02:25].
In essence, AI agents interact with the environment through actuations of actions [00:02:41].
Levels of Autonomy in AI Agents
The deployment complexity and autonomy of AI agents can be understood through an analogy with self-driving cars, which also exhibit levels of autonomy [00:03:02].
- Level 1: Chatbot (2017 onwards)
- Primarily focused on retrieving information [00:03:12].
- Level 2: Agent Assist
- An agent generates suggested responses, but a human must approve the final message (e.g., customer service) [00:03:20].
- Level 3: Agent as a Service
- LLMs automate AI workflows for specific tasks (e.g., meeting bookings, writing job descriptions) [00:03:35].
- Level 4: Autonomous Agents
- An agent can delegate and perform multiple, interrelated tasks that share components, knowledge, and resources [00:03:51].
- Level 5: Fully Autonomous (Jarvis-like)
- Agents are fully trusted, delegated with sensitive measures like security keys, and perform actions entirely on behalf of the user [00:04:16]. This level is comparable to a Jarvis-like AI agent from Iron Man [00:04:18].
Unlike high-risk applications like self-driving cars where errors can be catastrophic, AI agents for general tasks can be segmented into low-risk and high-risk categories [00:05:06]. Low-risk tasks (e.g., filing reimbursements) can initially have human supervision and gradually build trust for automation [00:05:12]. Customer-facing tasks are generally considered higher risk [00:05:27].
Improving AI Agent Performance
Research focuses on enhancing LLMs for building effective AI agents by improving their reasoning and reflection capabilities, eliciting better behaviors, and learning from past interactions [00:05:41].
Enhancing Reflection and Self-Improvement
LLMs can solve reasoning tasks using methods like:
- Few-shot prompting: Providing examples of similar problems and their answers as context [00:06:40].
- Chain of thought: Instructing the model to “think step by step,” allowing it to reason over tokens to reach an answer [00:07:01].
A more advanced technique combines these methods: self-refinement or self-improvement [00:07:23]. This involves:
- Asking the LLM to solve a problem step-by-step [00:07:41].
- Prompting the LLM to generate feedback on its initial answer (reflection) [00:07:52].
- Combining the reflection with the original question and answer to prompt the model again, allowing it to update its answer and internal processes [00:08:18]. This process can be iterated until a correct answer is reached [00:08:43].
Challenges with Smaller LLMs
While effective for larger LLMs, this self-improvement process can be problematic for smaller, cost-efficient models (e.g., LLaMA 7B) [00:08:58]. The feedback generated by smaller models often contains “noise” that can propagate and degrade results, a phenomenon described as “the blind leading the blind” [00:09:16]. Additionally, the internal logic or “demonstrations” of large LLMs may be incompatible with smaller models, rendering their feedback useless [00:09:52]. Just as one simplifies explanations for a child, feedback for smaller models must be “dumbed down” to match their internal logic [00:10:36].
The Wiass
Method
To address these challenges, the Wiass
method proposes helping smaller models acquire self-improvement capabilities by distilling information from larger LLMs [00:10:44]. Instead of smaller models generating their own feedback, a large LLM or external tool (like Python scripts for math tasks) is used to edit and tailor the smaller model’s feedback [00:11:10]. This corrected feedback then guides the smaller model’s iterative updates [00:11:55].
By collecting traces of successful trial-and-error processes, these models can be fine-tuned using on-policy data [00:12:21]. This approach has shown significant improvements (up to 48% accuracy after three iterations) in mathematical reasoning problems, outperforming supervised training on static data [00:13:37]. The key takeaway is that models can learn self-reflection without explicit human supervision by using synthetic data generation [00:15:01].
Eliciting Stronger Model Behavior (Test-Time Scaling)
While pre-training LLMs is often limited by compute, data size, and parameter size [00:16:11], a promising direction is “test-time scaling” [00:17:27]. This involves taking an existing pre-trained model and giving it more steps or budget during inference to achieve better results [00:17:31]. Examples include instructing the model to “think step by step” or perform reflection [00:17:41].
Tree Search for Sequential Decision-Making
For sequential decision-making tasks (e.g., dialogue), AI agents need to strategize and plan ahead [00:18:37]. This is analogous to playing chess, where anticipating opponent moves is crucial [00:19:44]. Tree search algorithms, like those used in AlphaGo, enable an agent to:
- Propose potential moves [00:20:27].
- Simulate outcomes of those moves [00:20:30].
- Evaluate the quality of the outcomes [00:20:36].
- Repeat this process multiple times to find the strongest move [00:20:40].
In conversational settings, a method called GPZ0
applies Monte Carlo Tree Search (MCTS) without requiring training data [00:20:53]. It leverages LLMs to:
- Search for promising actions (policy) [00:21:26].
- Simulate action outcomes [00:21:33].
- Evaluate action quality [00:21:43].
- Simulate opponent behavior (e.g., user responses) [00:22:04].
Unlike closed-loop MCTS, conversational tasks require “open-loop MCTS” to account for the variance in human responses by stochastically sampling possible simulated paths [00:23:01]. For tasks with clear, objective goals (e.g., persuading someone to donate to charity), GPZ0
can generate more competitive results without specific training [00:23:57]. Evaluations with LLMs and human studies show that agents using this planning algorithm are more persuasive, natural, and coherent [00:24:49]. These models also self-discover task information, like the importance of not asking for donations too early [00:25:08], and learn to diversify strategies [00:25:30].
Expanding to General AI Agent Tasks
To enable LLMs to perform various AI agent tasks beyond conversation, they need to perceive the world visually, processing both images and text [00:26:21]. Traditional Visual Question Answering (VQA) models are not sufficient for agent tasks that involve computer screenshots and action sequences (e.g., “clear my shopping carts” requires clicking buttons) [00:27:19]. While humans can achieve 88% success on benchmarks like Visual Web Arena (navigating browsers) and OS World (interacting with a Linux environment), simple LLMs like GPT-4V without planning achieve only 16% [00:28:02].
The R MCTS
(Reinforcement Learning Monte Carlo Tree Search) algorithm has been introduced to address this [00:28:48]. It extends MCTS by incorporating:
- Contrastive Reflection: Allows agents to learn from past interactions and dynamically improve search efficiency [00:29:11]. This is achieved through a memory module that caches learned experiences (successes or errors) into a vector database for future retrieval and decision enhancement [00:29:39].
- Multi-Agent Debate: Improves the reliability of state evaluation. Instead of a single LLM prompting for progress, two or more agents debate why an action is good or bad, counterbalancing biases [00:29:21].
R MCTS
builds a search tree on the fly, providing reliable state estimates through multi-agent debate and performing contrastive self-reflection at the end of tasks to learn from experience [00:31:07]. This approach outperforms other search algorithms and non-search methods on benchmarks like Visual Web Arena and OS World, demonstrating improved performance without additional human supervision [00:32:18].
Improving Base LLM with Generated Data (Exploratory Learning)
The knowledge gained through these advanced search and self-improvement processes can be transferred back to train the base LLM [00:33:30].
Instead of traditional “imitation learning” (direct training on the best action found in the tree), “exploratory learning” treats the entire tree search process as a single trajectory [00:33:55]. The model learns how to linearize the search tree traversal, motivating it to learn how to explore, backtrack, and evaluate [00:34:03]. This teaches the model to improve its decision processes by itself, showing significant gains over imitation learning, especially with constrained test-time budgets [00:34:21].
Future Directions and Challenges
The field of AI agent ecosystems is rapidly evolving [00:36:09]. Key areas of ongoing research and development include:
- Reducing Reliance on Search Trees: Developing better reinforcement learning (RL) methods or model predictive control to make the expensive environment setup and interaction more efficient [00:35:55].
- Improved Control and Autonomous Exploration: Enhancing the agent orchestration layer, as seen in frameworks like ARCollex, which offers features like continuous learning and task decomposition [00:36:32].
- Interdisciplinary Approaches: Combining machine learning expertise with systems, human-computer interaction (HCI), and security expertise to advance systems in a deeper, more practical way [00:37:06].
- Multi-Agent and Multi-User Planning: Moving beyond single-agent, single-task benchmarks to address challenges in scenarios where multiple humans interact with multiple agents simultaneously on the same computer [00:37:37]. This introduces complex system-level problems like scheduling, database interaction (to avoid side effects), security (human handover points), and quantifying adversarial settings [00:37:48].
- Realistic Benchmarks: Establishing benchmarks that consider not only task completion but also efficiency and security for future applications [00:38:25].
The development and challenges and benefits of AI agents are significant, pushing the boundaries of what LLMs can achieve autonomously [00:00:12].