Evaluating the performance of browser agents

From: aidotengineer

This article provides an overview of how browser agents are evaluated, their current capabilities, common failure patterns, and future development areas, based on findings from the WebBench benchmark.

What are Browser Agents?

A browser agent is an AI that can control a web browser and execute tasks on behalf of a user [00:00:48]. This technology has become feasible in the last year due to advancements in large language models and supporting infrastructure [00:01:08].

Most browser agents operate on a three-step loop:

Observing: The agent analyzes the browser’s context (e.g., screenshots via a VLM approach or HTML/DOM extraction via a text-based approach) to determine the next action [00:01:32].
Reasoning: The agent plans the necessary steps to execute the user’s task [00:01:53].
Acting: The agent performs actions like clicking, scrolling, or filling in text, which leads to a new browser state, restarting the loop [00:02:09].

Common use cases for browser agents include web scraping, software QA, form filling (e.g., job applications), and generative Robotic Process Automation (RPA) [00:02:46].

Evaluating Browser Agents

Evaluating a browser agent involves giving it a task and assessing if it was completed [00:03:40]. In practice, this is complex and requires considering:

Task Data Set: Tasks must be realistic, feasible, domain-specific, and scalable [00:03:52].
Evaluation Performance: Can be automated with validation functions, done manually by human annotators, or by LLM-as-a-judge approaches [00:04:15].
Infrastructure: The performance of browser agents is significantly impacted by the underlying infrastructure they run on [00:04:30].

Types of Tasks

Tasks are broadly categorized into:

Read Tasks: Involve information gathering and collection, similar to web scraping [00:04:46].
Write Tasks: Involve interacting with and changing the state of a website [00:04:54]. Write tasks are generally more complicated and challenging for agents to perform [00:05:08].

WebBench Benchmark Findings

WebBench is a benchmark data set and evaluation that includes over 5,000 tasks (both read and write) across nearly 500 different websites [00:05:21]. It is currently considered the largest-scale data set for web use [00:05:40].

Performance on Read Tasks

Browser agents show strong performance on read tasks. The leading web agent in the benchmark achieved around 80% success, comparable to an OpenAI operator with human-in-the-loop supervision [00:05:52]. This indicates that agents are effective at information retrieval and data extraction from the web [00:06:18]. Failures in read tasks are often due to infrastructure and internet issues rather than agent capability [00:06:25].

Performance on Write Tasks

Overall performance on write tasks is significantly worse, dropping by 50% or more compared to read tasks [00:07:02]. Reasons for this struggle include:

Longer Trajectory: Write tasks typically involve more steps, increasing the likelihood of an agent making a mistake and failing [00:07:35].
Complicated Interactions: Write tasks often require interacting with more complex or difficult parts of the site and user interfaces, involving data input and extraction [00:07:59].
Login and Authentication: Many write tasks require logging in, which is challenging for agents due to interactive elements and managing credentials [00:08:27].
Anti-bot Protections: Sites with write tasks often have stricter anti-bot measures, and performing write actions can trigger captchas or blocks [00:08:53].

Combined Read and Write Performance

The best agent in the benchmark achieved approximately two-thirds success on combined read and write tasks, while the average was just over 50% [00:09:47]. Despite these numbers, the fact that web agents can achieve such results on a challenging benchmark, given only a few years of development, is considered impressive [00:10:13].

Failure Patterns

Failure patterns can be categorized into agent failures and infrastructure failures:

Agent Failures: Occur when the agent itself is responsible for the failure due to its own abilities, such as inability to interact with a popup or timing out [00:10:52]. A more intelligent or capable agent should theoretically be able to complete these tasks [00:11:21].
Infrastructure Failures: Related to the framework or infrastructure the agent is running on, rather than the agent’s intelligence <a class=“yt=“yt-timestamp” data-t=“00:11:35”>[00:11:35]. Examples include being flagged as a bot and blocked from entering a site, or inability to access email for verification due to framework limitations [00:12:02]. Improving infrastructure could significantly boost overall agent performance [00:13:02].

Latency and Speed

A significant flaw in current browser agents is their slowness [00:13:13]. This is primarily due to the repeated observe-plan-reason-act loop, along with mistakes, failed tool calls, and retries [00:13:42]. While this latency might be acceptable for asynchronous, “set and forget” applications, it poses a major problem for real-time applications [00:13:57].

Implications for AI Engineers

Based on these findings, there are three key takeaways for engineers building with browser agents:

Pick your use case wisely:
- Read Use Cases: Agents are already quite performant for tasks like deep research or mass information retrieval [00:14:50].
- Write Use Cases: Out-of-the-box agents may not be accurate enough for tasks involving form filling or changing software state [00:15:11]. Rigorous testing and building internal evaluations are crucial [00:15:30].
Browser Infrastructure Matters: The chosen browser infrastructure can significantly impact performance [00:15:53]. It is advisable to test multiple providers due to their interoperability [00:16:10]. Different systems may handle captchas or proxies better for specific use cases [00:16:22].
Try a Hybrid Approach: For production-scale use cases, combining browser agents with more deterministic workflows (e.g., Playwright) can leverage the strengths of both [00:16:54]. Agents can handle long-tail, dynamic, or frequently changing tasks, while deterministic workflows ensure reliability and accuracy for high-volume, consistent tasks [00:17:19].

Looking Ahead: Future Development

The industry is expected to see significant improvements in browser agents [00:17:55]. Key areas for development include:

Better Long Context Memory: Essential for longer write tasks that involve many steps [00:18:04].
Improved Browser Infrastructure Primitives: Addressing major blockers like login/authentication and payments will unlock significant value [00:18:21].
Better Models: Ongoing improvements in the models powering browser agents, partly through specialized training environments and sandboxes that focus on browser and computer use [00:18:48].

Notable Examples of Agent Behavior

During the WebBench execution, several interesting and sometimes alarming behaviors were observed:

An agent got stuck on GitHub and unblocked itself by conversing with GitHub’s virtual assistant AI [00:19:26].
An agent posted a comment on a Medium article that became the top-liked post, raising questions about the Turing test [00:19:45].
Agents booked restaurant reservations on behalf of users, leading to real-world externalities [00:20:05].
An agent, blocked by Cloudflare, searched Google for ways to bypass Cloudflare verification, demonstrating emergent behavior not explicitly predicted [00:20:23].

Tubegraph

Explorer

Table of Contents