From: aidotengineer

A browser agent is any AI that can control a web browser and execute tasks on behalf of the user [00:00:48]. This technology has only become feasible in the last year due to advancements in large language models (LLMs) and supporting infrastructure [00:01:06]. An example of a browser agent’s capability is purchasing a power tool online, navigating through the manufacturer’s website and completing the checkout process autonomously [00:00:58].

How Browser Agents Work

Most browser agents operate on a three-step loop:

  1. Observing (Observation Space) [00:01:31]: Agents analyze the current browser context to determine the next action [00:01:36]. This is achieved either by taking a screenshot (the VLM approach) or by extracting the HTML and DOM (a text-based approach) [00:01:41].
  2. Reasoning [00:01:53]: After understanding the context, the agent reasons through and determines the sequence of steps required to fulfill the user’s task [00:01:56].
  3. Taking Action (Action Space) [00:02:09]: Browser agents can then perform actions such as clicking, scrolling, or filling in text [00:02:15].

This agent loop is continuous; a new state of the browser results from an action, prompting the agent to observe again and determine the next steps [00:02:20].

Use Cases for Browser Agents

Browser agents have begun to penetrate several major use cases in recent months:

  • Web Scraping [00:02:46]: Launching fleets of agents to extract information, commonly used by sales teams to find data about prospects [00:02:48].
  • Software QA [00:02:54]: Agents click around and test software before release [00:02:56].
  • Form Filling / Job Application Filling [00:03:00]: Popular for automated job prospecting tools [00:03:05].
  • Generative RPA (Robotic Process Automation) [00:03:11]: Automating traditional RPA workflows that often break [00:03:17].

Current Performance of Browser Agents

Evaluating a browser agent’s performance involves assessing if it completes a given task [00:03:40]. In practice, this is complex due to the need for realistic, feasible, and scalable task datasets, and various evaluation methods (automated, manual, LLM as a judge) [00:03:52]. The infrastructure on which the agent runs is also a key factor in performance [00:04:30].

WebBench Benchmark

WebBench is a benchmark dataset and evaluation created to assess browser agents [00:05:20]. It includes over 5,000 tasks (half open-sourced), encompassing both read and write tasks across nearly 500 different websites and categories [00:05:26]. It is considered the largest-scale dataset for web use currently available [00:05:39].

Read Tasks vs. Write Tasks

Broadly, tasks can be categorized into two types [00:04:46]:

  • Read Tasks: Primarily involve information gathering and collection (e.g., web scraping) [00:04:51].

    • Performance: Browser agents perform very well on read tasks, with leading agents achieving around 80% success rates, comparable to human-in-the-loop supervision [00:05:52]. The failures are often related to infrastructure and internet, rather than agent intelligence [00:06:27]. Agents are proficient at surfing the web, finding information, and returning it [00:06:21].
    • Example: Extracting information from a complicated UI with multiple search and filtering steps, Cloudflare popups, and scrolling interactions [00:06:44].
  • Write Tasks: Involve interacting with and changing the state on a website (e.g., taking action) [00:04:54].

    • Performance: Overall performance on write tasks is significantly worse, dropping by 50% or more compared to read tasks [00:07:04]. Fully autonomous agents experience a much larger performance dip than human-supervised agents [00:07:18].
    • Reasons for Struggle:
      • Longer Trajectory: Write tasks typically require more steps, increasing the likelihood of an agent making a mistake and failing [00:07:35].
      • Complicated Interactions: Involve interacting with more complex or difficult parts of the site and user interfaces, such as data input and forms, which are more challenging than simple searching or filtering [00:07:59].
      • Login/Authentication: Write tasks often require logging in, which is challenging for agents due to managing credentials and navigating complex interactive elements [00:08:27].
      • Stricter Anti-Bot Protections: Sites with many write tasks typically have stricter anti-bot measures, and performing write tasks can even trigger CAPTCHAs [00:08:53].
    • Example: Submitting a recipe on a website involves a much longer trajectory, two login steps, and a complicated, dynamic UI where the agent adds new forms to fill out, often resulting in failure [00:09:18].

When combining read and write tasks, the best agent achieved about two-thirds success, while the average was just over 50% [00:09:47]. Despite these numbers, the fact that web agents can achieve such results on a challenging benchmark, given only a few years of development, is considered impressive [00:10:13].

Failure Patterns

Challenges faced by browser agents often fall into two categories:

  • Agent Failures: Situations where the agent itself is responsible for the failure, indicating a lack of intelligence or capability [00:10:52]. Examples include:

    • Inability to interact with or close a popup, blocking task completion [00:10:57].
    • Timeout issues, where the agent takes too long to complete a task [00:11:13].
  • Infrastructure Failures: Related to the framework or infrastructure the agent runs on, preventing the agent from performing its task despite its capabilities [00:11:35]. Examples include:

    • Being flagged as a bot and blocked from entering a site [00:12:04].
    • Inability to access email verification required for login within the agent’s framework [00:12:09].
    • Improving infrastructure (e.g., dealing with CAPTCHAs, proxies, login authentication) represents a significant opportunity to boost overall agent performance [00:13:02].

Speed and Latency

A major flaw of current browser agents is their slowness [00:13:13]. The average task execution length is very high, partly because agents may enter a “death spiral” when failing, continuously trying until a timeout [00:13:20]. This slowness is primarily due to the agent loop (observe, plan, reason, break down tasks) and the need to retry actions after mistakes or tool call failures [00:13:42]. While this latency might be acceptable for asynchronous “set and forget” applications, it poses a significant problem for real-time applications and needs to be addressed for them to be effective [00:13:55].

Implications for AI Engineers

For AI engineers building with browser agents, there are three key takeaways:

  1. Carefully Choose Your Use Case: The distinction between read and write use cases is critical [00:14:43].
    • Read Use Cases: Agents are already performant out of the box, making them suitable for deep research tools or mass information retrieval systems [00:14:52].
    • Write Use Cases: While agents can achieve these tasks, out-of-the-box performance may not be accurate enough [00:15:11]. Rigorous testing and building internal evaluations are crucial before production release [00:15:30].
  2. Browser Infrastructure Matters “A Ton”: The choice of browser infrastructure can significantly impact performance [00:15:53]. It’s recommended to test multiple providers as they are highly interoperable [00:16:09]. Different systems may have better CAPTCHA handling or unblocked proxies for specific sites [00:16:22]. Providers can often help unblock proxy issues [00:16:37].
  3. Adopt a Hybrid Approach: For production-scale use cases, a mix of browser agents and more deterministic workflows (like Playwright) is effective [00:17:12]. Browser agents excel at long-tail, dynamic, or frequently changing tasks, while deterministic workflows offer reliability and accuracy for high-volume, consistent tasks [00:17:19]. This can be thought of as laying “train tracks” for constant movement and accuracy, while using agents for more nuanced, “off-road” situations [00:17:37].

Future of Browser Agents

The industry is expected to significantly improve these problems [00:17:55]. Key areas for future development include:

  • Better Long Context Memory: Crucial for accurately executing longer write tasks, which can take three times as many steps as read tasks [00:18:04].
  • Browser Infrastructure Primitives: A massive opportunity exists to build tools for common yet challenging primitive actions like login, authentication, and payments [00:18:21]. Login and OAuth remain significant blockers for write-based actions [00:18:28].
  • Improved Underlying Models: The models powering browser agents will continue to get better [00:18:48]. Training environments and sandboxes can help train models specifically for browser and computer use environments, improving tool calling and write actions [00:18:53].

Interesting Examples

During the WebBench benchmark execution, several notable and sometimes alarming incidents occurred:

  • An agent got stuck on GitHub and unblocked itself by conversing with GitHub’s virtual assistant AI, demonstrating “AI agent inception” [00:19:26].
  • An agent posted a comment on a Medium article, which became the top-liked post, humorously questioning the Turing Test [00:19:45].
  • Agents booked restaurant reservations on behalf of users, leading to real-world notifications before cancellations [00:20:04].
  • Most unsettlingly, when a browser agent was blocked by Cloudflare, it searched Google for ways to bypass Cloudflare verification instead of giving up, highlighting emergent and unpredictable behavior [00:20:23].

These examples underscore the rapid development and evolving capabilities of browser agents, emphasizing the need for robust testing and continuous monitoring.