From: aidotengineer

The development of AI agents has been a significant topic in computing, with figures like Bill Gates, Andrew Ng, and Sam Altman highlighting their potential impact [00:00:26]. While some skepticism exists regarding their planning abilities and practical solutions, a fundamental understanding of AI agents is crucial [00:00:43]. The recent advancements in large language models (LLMs) have significantly enhanced the power of AI agents [00:01:11].

What are AI Agents?

At their core, AI agents operate through a cyclical process involving:

  • Perception: Like humans, AI agents must understand their environment by sensing information from text, images, audio, video, and touch [00:01:20].
  • Reasoning: After gathering information, AI agents process it to understand tasks, break them down into steps, and determine appropriate tools or actions [00:01:37]. This “inner plan process” is often powered by LLMs and is referred to as “chain of thoughts reasoning” [00:01:59].
  • Reflection (Meta-reasoning): AI agents can perform meta-reasoning by asking if a chosen action was correct and if they should revisit previous steps [00:02:10].
  • Actions: This involves any output or behavior an AI agent performs, such as talking to humans or moving in an environment [00:02:24].

Essentially, AI agents interact with their environment through actions [00:02:41].

Levels of AI Agent Autonomy

Similar to the levels of autonomy in self-driving cars, AI agents can be categorized by their degree of independence and trust [00:03:02]:

  • Level 1 (Chatbot): Basic information retrieval, as seen in chatbots from 2017 [00:03:12].
  • Level 2 (Agent Assist): LLMs suggest responses for human agents (e.g., customer service), requiring human approval before sending [00:03:20].
  • Level 3 (Agent as a Service): LLMs automate AI workflows for specific tasks like meeting bookings or writing job descriptions [00:03:35].
  • Level 4 (Autonomous Agents): AI agents can delegate and perform multiple tasks simultaneously, sharing components, knowledge, and resources [00:03:51].
  • Level 5 (Jarvis-like AI): Full trust is placed in AI agents, delegating all security measures and allowing them to act on behalf of the user [00:04:16].

While self-driving cars are a high-risk example of AI agents [00:04:51], general AI agents can be applied to both low-risk (e.g., filing reimbursements with human supervision) and high-risk (e.g., customer-facing tasks) operations [00:05:06].

Improving AI Agent Task Execution

Improving AI agent task execution involves enhancing LLMs’ reasoning and reflection capabilities, eliciting better behaviors, and learning from past interactions [00:05:41].

1. Enhancing Reasoning and Reflection (Self-Improvement)

LLMs can improve their reasoning through various prompting methods [00:06:37]:

  • Few-shot prompting: Providing examples of problems and their answers as context [00:06:40].
  • Chain of thoughts (CoT): Instructing the model to “think step by step” to reach a correct answer [00:07:01].
  • Self-refinement/Reflection: A more recent method combines CoT with feedback generation [00:07:23]. The model generates an initial answer, then produces feedback on its own solution, and finally updates the answer based on this reflection [00:07:48]. This iterative process can be repeated until a correct answer is reached [00:08:43].

Challenges with Smaller LLMs

Smaller LLMs (e.g., LLaMA 7B) struggle with self-improvement because the feedback they generate can contain “noise” that propagates through correction steps, leading to worse results [00:09:01]. Also, internal logic from larger models (like GPT-4) may be incompatible with smaller models, making feedback unhelpful [00:09:52].

Distillation from Larger Models (Wiass)

To address this, a method called Wiass distills self-improvement capabilities from larger models to smaller ones [00:10:44]:

  1. A smaller model attempts to solve a problem and generates initial feedback [00:11:40].
  2. A larger LLM or external tool (like Python scripts for math) edits this feedback to make it more tailored and useful for the smaller model [00:11:47].
  3. This corrected feedback is then used to update the smaller model’s answer, in an iterative process until the problem is solved [00:11:55].
  4. The successful “traces” (trial and error trajectories) are filtered and used to fine-tune the smaller models, enabling them to perform self-improvement with guidance [00:12:26].

This “on-policy self-supervision” approach, using synthetically generated data, has shown significant improvements in mathematical reasoning tasks, even outperforming supervised training data [00:13:50]. It enables models to learn self-improvement without explicit human supervision [00:15:01].

Beyond pre-training, enhancing performance at “test time” by providing more steps or “budgets” during inference can yield better results [00:17:29]. This is achieved through more complex chain-of-thought processing, particularly tree search [00:18:06].

Monte Carlo Tree Search (MCTS) for Dialogues

Inspired by games like chess and AlphaGo, Monte Carlo Tree Search (MCTS) can be applied to sequential decision-making tasks, such as dialogue [00:19:44]. The process involves:

  • Proposing a move/action: An LLM acts as the policy to suggest a promising action [00:21:26].
  • Simulating outcomes: An LLM simulates what would happen after the action [00:21:33].
  • Evaluating action quality: An LLM evaluates the outcome of the action after interacting with the environment [00:21:43].
  • Updating quality: Action quality is updated over time, leading to stronger moves through simulation [00:21:55].
  • Opponent simulation: Another LLM simulates the opponent’s (e.g., user’s) behavior [00:22:03].

For conversational tasks, an “open-loop MCTS” is used to account for human response variance, stochastically sampling possible simulated tests [00:23:08]. This approach, called GDP0, improves persuasion tasks without explicit training data, leading to higher donation rates and more convincing, natural, and coherent interactions [00:23:57]. Models also self-discover effective strategies, like delaying the “big ask” and diversifying persuasion tactics [00:25:08].

R-MCTS for Broader AI Agent Tasks

To extend these planning capabilities to broader AI agent tasks beyond dialogue, a new algorithm called RL MCTS (R-MCTS) was introduced [00:28:48]. This algorithm, applied to visual language models (VLMs) for action-based tasks like clearing shopping carts on a computer screenshot [00:27:19], incorporates:

  • Contrastive Reflection: Allows agents to learn from past interactions and dynamically improve search efficiency by internalizing successes or errors [00:29:11]. These experiences are saved in a vector database for future retrieval [00:29:58].
  • Multi-agent Debate: Improves state evaluation by using multiple agents to debate the quality of an action (why it’s good or bad), counterbalancing biases from a single model [00:29:21].

R-MCTS outperforms other search algorithms on benchmarks like Visual Web Arena (browsing tasks) and OS World (Linux computer tasks like file system navigation or using apps) [00:31:18]. It shows that augmenting VLMs with search algorithms can improve performance without additional human supervision [00:32:46].

3. Learning from Examples and Traces (Exploratory Learning)

The knowledge gained through tree search can be transferred back into the base LLM through a training process [00:33:30].

  • Exploratory Learning: Unlike imitation learning (which directly transfers the best action found), exploratory learning treats the entire tree search process as a single trajectory [00:33:55]. This teaches the model how to linearize the search tree traversal, motivating it to learn how to explore, backtrack, and evaluate [00:34:03]. This makes models improve their decision processes inherently [00:34:21].

Exploratory learning leads to better performance with constrained budgets for test-time computation compared to imitation learning [00:34:32].

Future Directions and Challenges

Current benchmarks often focus on a single AI agent performing a single task [00:37:25]. Future work aims to address more complex scenarios:

  • Multi-task, Multi-agent Environments: How can a single human assign multiple tasks to a model on the same computer, or how do multiple humans interact with multiple agents concurrently [00:37:40]?
  • System-level Problems: This includes scheduling, database interaction to avoid side effects, and enhanced security measures, such as determining when to handover to a human or request supervision [00:37:48].
  • Realistic Benchmarks: There is a need for more realistic benchmarks that integrate system considerations, focusing not only on task completion but also on efficiency and security [00:38:25].

The open-source Arcollex AI framework is being developed to integrate these research findings, providing features like continuous learning and task decomposition [00:36:40]. This involves combining machine learning expertise with systems, HCI, and security knowledge to advance AI agent systems in a deeper and more practical way [00:37:06].