From: aidotengineer
AI agents are generating significant excitement in the computing world, with figures like Bill Gates calling them the “biggest revolution in computing” [00:00:29], Andrew Ng noting “massive AI progress” [00:00:32], and Sam Altman predicting 2025 will be “the year of agent” [00:00:39]. Despite some negative voices dismissing them as mere wrappers for large language models (LLMs) or criticizing examples like AutoGPT for not solving practical solutions [00:00:43], the advancements in AI agents are substantial. Agents are not a new concept, but the power of large language models has significantly enhanced their capabilities [00:01:08].
What Are AI Agents?
At their core, AI agents interact with an environment through actuations of actions [00:02:41]. This process generally involves four key steps:
- Perception: Agents need to understand their environment by sensing information from various sources like text, images, audio, video, and touch [00:01:20].
- Reasoning: Once information is perceived, agents process it to understand how to complete tasks, break them down into individual steps, and utilize environmental inputs to determine appropriate tools or actions. This inner planning process is often referred to as “chain of thoughts reasoning,” frequently powered by large language models [00:01:37].
- Reflection (Meta-reasoning): Agents can perform meta-reasoning steps, asking themselves if a chosen action was correct and course-correcting if needed [00:02:10].
- Actions: These are the physical or digital performances an agent undertakes, such as talking to humans, moving from one point to another, or manipulating interfaces [00:02:25].
Levels of AI Agent Autonomy
The deployment of AI agents can be understood through an analogy to the levels of autonomy in self-driving cars [00:03:02]:
- Level 1: Chatbot (2017): Simple information retrieval [00:03:12].
- Level 2: Agent Assist: An agent uses an LLM to generate suggested responses, but a human must approve the message before sending (e.g., customer service) [00:03:20].
- Level 3: Agent as a Service: LLMs automate AI workflows for specific tasks, acting as a service (e.g., meeting bookings, writing job descriptions) [00:03:35].
- Level 4: Autonomous Agents: An agent can delegate and perform multiple tasks that have shared components, knowledge, and resources [00:03:51].
- Level 5: Jarvis/Iron Man-level Agents: Full trust and delegation of security measures, allowing agents to perform on behalf of the user 100% autonomously [00:04:16].
Self-driving cars are an example of an agent performing perception, reasoning, planning, and execution, but they are high-risk due to potential errors affecting lives [00:04:37]. AI agents can be separated into low-risk tasks (e.g., filing reimbursements with human supervision) and high-risk customer-facing tasks. Over time, the trend is to move from back-office automation to front-office deployment [00:05:06].
Progress in Improving AI Agents
Recent advancements in AI agents focus on three key areas:
- Improving LLMs for better reasoning and reflection [00:05:41].
- Eliciting better behaviors from existing LLMs for AI agent tasks [00:05:52].
- Learning from examples and traces to optimize LLMs for AI agent tasks [00:06:00].
Improving Reasoning and Reflection
Traditional methods for mathematical reasoning with LLMs include:
- F-shot prompting: Providing examples of similar problems and their answers as context [00:06:40].
- Chain of thought prompting: Instructing the model to “think step by step,” allowing it to reason over tokens to reach an answer [00:07:01].
More recently, a combined approach uses a “reflection” or “self-improvement” process. The model generates an initial answer, then is prompted to generate feedback on its own answer (e.g., “in step two, blah blah blah is incorrect…”). This feedback is then combined with the original question and answer to prompt the model again, allowing it to update and refine its answer [00:07:23]. This process can be iterated until a correct answer is reached [00:08:43].
However, challenges arise with smaller LLMs (e.g., Llama 7B) where generated feedback can contain “noise” that propagates to correction steps, leading to worse results (“the blind leading the blind”) [00:08:58]. Also, internal logic from larger models may be incompatible with smaller models, making their feedback unhelpful [00:09:52].
To address this, the Wiass
method was proposed [00:11:37]:
- A smaller model generates an initial answer and self-feedback.
- A larger LLM (or external tool like Python scripts) edits this feedback to be more tailored for the smaller model’s internal logic [00:11:18].
- This corrected feedback is used to update the answer, iterating until the problem is solved [00:12:01].
- The trajectories of these trial-and-error processes are filtered into a balanced set to train smaller models for self-improvement under the guidance of larger models or tools [00:12:26].
This approach shows improved performance, particularly on mathematical reasoning tasks, demonstrating that on-policy self-provisioning is useful and models can learn self-improvement without explicit human supervision [00:13:48]. The limitation remains the ceiling effect of the verifiers (larger LLMs); if they cannot edit correctly, smaller models cannot distill information effectively [00:15:18].
Eliciting Better Behaviors and Planning
While pre-training large language models is resource-intensive and often limited by compute, data size, and parameter size (scaling laws) [00:16:11], a promising direction is test-time scaling [00:17:29]. This involves taking an existing pre-trained model and giving it more steps or budget during inference (e.g., “think step by step,” reflection) to achieve better results [00:17:33].
For sequential decision-making tasks, like dialogue or games such as chess, agents need to plan ahead and simulate future moves. This concept is analogous to how Grand Masters plan many steps ahead in chess [00:19:49]. Algorithms like AlphaGo use Monte Carlo Tree Search (MCTS), which proposes moves, simulates outcomes, evaluates their value, and repeats to find stronger moves [00:20:22].
The GD P0
(Zero Training MCTS) work adapted these ideas to conversational settings, specifically donation persuasion tasks [00:20:54]. It uses prompting of large language models for:
- Searching for potential promising actions [00:21:26].
- Simulating action outcomes [00:21:33].
- Evaluating action quality after interacting with the environment [00:21:43].
- Simulating the opponent’s (user’s) behavior [00:22:05].
Unlike closed-loop MCTS, this approach uses open-loop MCTS to account for the variance in human responses in a stochastic process [00:23:10]. Evaluations showed that agents using this planning algorithm achieved better donation rates and were perceived as more convincing, natural, and coherent. The models also self-discovered task information, such as not asking for donations too early and diversifying persuasion strategies (emotional vs. logical appeal) [00:25:08].
Transferring Policy to Other Tasks / Visual Agents
The concept of MCTS can be extended to broader AI agent spaces that involve tool use and manipulations beyond just talking [00:26:24]. The Exact
work focuses on teaching large language models to perceive the world visually, leading to Visual Language Large Models (VLMs) that process both images and text [00:26:41].
Traditional VLMs are trained for tasks like visual question answering, but AI agent tasks require action-based interaction (e.g., clearing a shopping cart based on a screenshot) [00:27:19]. Standard models like GPT-4V, without specific planning, perform poorly on such tasks compared to humans (16% vs. 88% success on Visual Web Arena) [00:28:13].
To improve performance without retraining expensive models, the R MCTS
(RL Monte Carlo Tree Search) algorithm was introduced [00:28:48]. It extends MCTS by incorporating:
- Contrastive reflection: A memory module allows agents to learn from past interactions, internalize successes or errors, and save experiences to a vector database. This knowledge is then retrieved for future tasks to enhance decision-making [00:29:11].
- Multi-agent debate: Instead of single prompts, the progress evaluation is made more robust by having models debate whether an action is good or bad and why, counterbalancing biases [00:30:21].
R MCTS
builds a search tree on the fly, uses multi-agent debate for reliable state estimates, and performs contrastive self-reflection at the end of each task to improve future execution [00:31:07].
Evaluations on benchmarks like Visual Web Arena (browsing tasks) and OS World (computer desktop tasks in Linux) show that R MCTS
outperforms other search algorithms and non-search methods. Augmenting visual large language models with search algorithms significantly improves performance without additional human supervision [00:32:16]. This method ranks #1 on the Visual Web Arena leaderboard and is the best non-trained method in OS World [00:33:02].
Learning from Generated Data
To transfer the knowledge obtained through search processes into the training of base large language models, Exploratory Learning was developed [00:33:43]. Unlike imitation learning (direct transfer of best actions), exploratory learning treats the entire tree search process as a single trajectory. Models learn to linearize the search tree traversal, motivating them to learn how to explore, backtrack, and evaluate actions themselves [00:33:55]. This teaches models to improve their decision processes by themselves [00:34:23].
Future Prospects and Challenges
Many advancements in AI agents, particularly those relying on test-time compute scaling and training on refined exploratory traces, do not require human supervised information, making them accessible in academic or smaller company settings [00:35:27].
Future work focuses on:
- Reducing reliance on search trees and using model predictive control to cut down expensive environment setup and interaction [00:35:57].
- Improving control abilities and autonomous exploration for agent orchestration layers [00:36:29].
An open-source agent framework, ARX
, has been released, offering features like continuous learning and task decomposition to provide developers more flexibility [00:36:40].
The development of AI agents requires combining expertise from machine learning, systems, human-computer interaction (HCI), and security [00:37:06]. Current benchmarks primarily focus on single agents performing single tasks [00:37:25]. However, future challenges include:
- Multi-tasking on a single computer: Addressing system-level problems like scheduling, database interaction, security, and knowing when to request human supervision or handover [00:37:40].
- Multi-user, multi-agent planning: More complex scenarios where multiple humans interact with various agents, leading to complicated and potentially adversarial settings [00:38:04].
The goal is to establish more realistic benchmarks that integrate system considerations and algorithms that account for not only task completion but also efficiency and security [00:38:25].