From: aidotengineer

The concept of AI agents is not new, but large language models have significantly enhanced their capabilities [01:08:00]. Bill Gates is bullish on agents, calling them the “biggest revolution in computing,” and Andrew Ng considers them a “massive AI progress” [00:26:00]. Sam Altman from OpenAI predicts 2025 as the “year of agent” [00:36:00].

AI agents generally operate through a process involving perception, reasoning, reflection, and actions [01:17:00]. Like humans, agents perceive their environment through sensory information (text, image, audio, video, touch) [01:20:00]. This information then goes through a reasoning process to understand tasks, break them down into steps, and identify appropriate tools or actions [01:37:00]. This inner planning process is often referred to as “chain of thoughts reasoning,” powered by large language models [01:59:00]. Agents can also perform “meta reasoning” or “reflection,” evaluating whether an action was correct and deciding to backtrack if not [02:10:00]. Finally, actions are any operations performed by the agent, such as talking to humans or moving physically [02:25:00]. Agents interact with their environment through these actuations of actions [02:41:00].

To understand the ease of deploying agents, an analogy with the different levels of autonomy in self-driving cars can be used [03:02:00].

Levels of AI Agent Autonomy

Level 1: Chatbot (2017)

At this foundational level, AI agents primarily retrieve information [03:12:00].

Level 2: Agent Assist

In this stage, agents, such as customer service AI, use large language models to generate suggested responses [03:20:00]. However, a human still needs to approve the final message before it’s sent [03:28:00].

Level 3: Agent as a Service

This current level involves using large language models to automate AI workflows, functioning as a service [03:35:00]. Examples include automated meeting bookings or writing job descriptions [03:46:00].

Level 4: Autonomous Agents

At this level, an AI agent can delegate and perform multiple tasks simultaneously [03:51:00]. These tasks often have interconnections, sharing components, knowledge, and resources [03:59:00].

Level 5: Jarvis / Iron Man

This is the highest level of autonomy, where users trust agents 100% and delegate all security measures, such as keys, for the agents to perform actions on their behalf [04:16:00].

Risk Levels in AI Agent Tasks

While a self-driving car is an example of an agent performing perception, reasoning, planning, and executing driving actions, it’s considered a very high-risk application where errors could be life-threatening [04:37:00].

AI agent tasks can be separated into:

  • Low-risk tasks [05:08:00]: These include back-office tasks like filing reimbursements, where human supervision can be maintained initially and trust can be built over time for full automation [05:12:00].
  • High-risk tasks [05:28:00]: Customer-facing tasks are typically considered higher-risk [05:28:00]. The progression is often from automating back-office tasks to front-office tasks over time [05:30:00].

Improving AI Agent Performance

Research aims to improve large language models for better reasoning and reflection, elicit better behaviors for AI agent tasks, and learn from past examples to optimize models for agent tasks [05:41:00].

One method for self-improvement involves prompting models to generate feedback and then iteratively refining their answers, a process called self-refinement or self-improvement [08:12:00]. This can involve using a larger language model to edit the feedback generated by a smaller model, making it more tailored to the smaller model’s internal logic [11:18:00].

Another approach focuses on “test time scaling,” where existing pre-trained models are given more steps or budget during inference to achieve better results [17:27:00]. This involves more complex tree search processes like Monte Carlo Tree Search (MCTS), which can significantly improve performance [18:11:00]. For example, in dialogue tasks, MCTS can simulate possible conversational strategies to drive real-world decision-making [25:50:00].

New algorithms like R-MCTS incorporate contrastive reflection and multi-agent debate to improve decision-making [28:48:00]. By equipping the system with a memory module, it can learn from past interactions and dynamically improve search efficiency across different tasks [29:39:00]. This method has shown significant improvements in benchmarks like Visual Web Arena and OS World [32:18:00].

Furthermore, “exploratory learning” can teach models how to explore, backtrack, and evaluate during a search process, leading to improved decision-making [33:55:00].

Future Directions and Best Practices for Building AI Agents

Ongoing work focuses on improving control abilities and autonomous exploration for the agent orchestration layer [36:27:00]. This involves integrating machine learning expertise with systems, HCI, and security expertise for more practical advancements [37:06:00].

Current benchmarks often focus on single agents performing single tasks [37:25:00]. Future challenges include scaling to scenarios where a single human wants a model to perform multiple tasks on the same computer, leading to system-level problems like scheduling, database interaction, security, and human handover points [37:37:00]. Even more complex are multi-user, multi-agent planning scenarios [38:15:00]. Establishing realistic benchmarks that consider system integrations, efficiencies, and securities, beyond just task completion, is a key area of focus [38:25:00].