Evaluation of browser agent performance

From: aidotengineer

The current state of browser agents is assessed through their performance, which is a critical factor for AI engineers building with them [00:00:36]. Evaluating a browser agent involves determining if it successfully completes a given task [00:03:43].

How to Evaluate a Browser Agent

Evaluating a browser agent is more complex in practice than simply giving a task [00:03:49]. Key considerations include:

Task Data Set Creation [00:03:52]:
- Tasks must be realistic and feasible [00:03:59].
- They often have a domain-specific element [00:04:05].
- The creation of tasks needs to be scalable, generating many tasks [00:04:07].
Evaluation Performance [00:04:12]:
- Evaluations can be automated using a validation function [00:04:15].
- Manual evaluations with human annotators are also possible [00:04:22].
- LLM-as-a-judge approaches are emerging [00:04:24].
Infrastructure [00:04:28]:
- The environment in which the browser agent runs significantly impacts its performance [00:04:30].

Types of Tasks

Tasks for browser agents are broadly categorized into two types [00:04:44]:

Read Tasks: Primarily involve information gathering and collection [00:04:52]. Examples include web scraping [00:04:52].
Write Tasks: Involve interacting with and changing the state of a website [00:04:54]. These represent actions a web agent would ideally perform [00:05:03]. Write tasks are generally more complicated and challenging to create and for agents to perform [00:05:08].

WebBench Benchmark Findings

The WebBench benchmark, consisting of over 5,000 tasks (half open-sourced) across nearly 500 websites, provides a comprehensive assessment of browser agent capabilities [00:05:21]. It is considered the largest-scale dataset for web use currently available [00:05:37].

Performance on Read Tasks

Browser agents demonstrate strong performance on read tasks:

An OpenAI operator with human-in-the-loop supervision achieves approximately 80% success [00:06:01].
Leading autonomous web agents also reach around 80% success [00:06:07].
This indicates that agents are highly capable of information retrieval and data extraction [00:06:18].
The remaining 20-25% of failures are typically attributed to infrastructure and internet issues, rather than agent intelligence [00:06:25].
An example read task involves navigating a complicated UI, multiple search/filtering steps, handling Cloudflare pop-ups, and performing scrolling interactions [00:06:36].

Performance on Write Tasks

Performance on write tasks is significantly worse, dropping by 50% or more compared to read tasks [00:07:04].

The OpenAI operator with human-in-the-loop only experiences a minor dip in performance (~10%) [00:07:10].
Fully autonomous agents show a much larger decrease in success rates [00:07:20].

Reasons for this disparity include:

Longer Trajectory: Write tasks require more steps, increasing the likelihood of an agent making a mistake and failing [00:07:35].
Complex UI Interaction: Write tasks often involve interacting with more difficult or complicated parts of a site’s user interface, such as data input, extraction, and complex forms [00:07:59].
Login/Authentication: Many write tasks require logging in, which is challenging for agents due to interactive complexity and credential management [00:08:27].
Anti-bot Protections: Sites with many write tasks often have stricter anti-bot measures, and performing write actions can trigger additional protections like CAPTCHAs [00:08:53].
An example write task, like submitting a recipe, involves a longer trajectory, two login steps, a dynamic and complicated UI, and often results in failure [00:09:18].

Overall Performance

When combining read and write tasks, the best agent achieved about two-thirds success, while the average was just over 50% [00:09:47]. The OpenAI operator with human supervision remained near 75-80% [00:09:59]. Despite these numbers, the current performance is considered impressive, given the relatively short development history of web agents [00:10:13].

Failure Patterns

Failure patterns are categorized into agent failures and infrastructure failures.

Agent Failures: These occur when the agent itself is responsible for the failure, indicating a lack of intelligence or capability [00:11:21]. Examples include:
- Inability to interact with or close pop-ups [00:10:57].
- Timeout issues due to taking too long to complete a task [00:11:13].
Infrastructure Failures: These are related to the framework or infrastructure the agent runs on, preventing the agent from performing its task despite its capabilities [00:11:35]. Examples include:
- Being flagged and blocked as a bot from entering a site [00:12:02].
- Inability to access email verification for login within the current framework [00:12:12].

Improving infrastructure could significantly boost overall agent performance [00:13:02].

Speed of Execution

A major flaw across browser agents is their slowness [00:13:13].

Average task execution length is very high, partly due to agents continuing to try tasks after an initial failure until timeout [00:13:20].
The browser agent loop (observe, plan, reason, act) contributes to this latency [00:13:42].
Mistakes, tool call failures, and repeated retries further extend execution time [00:13:46].
While acceptable for asynchronous “set and forget” applications, this latency is a huge problem for real-time applications [00:13:57].

Implications for AI Engineers

For AI engineers, these findings highlight several key takeaways:

Choose Use Case Carefully:
- For read-based use cases (e.g., deep research tools, mass information retrieval), current browser agents are already quite performant “out of the box” [00:14:43].
- For write-based functions (e.g., form filling, state changes), “out of the box” agents may not be accurate enough and require rigorous testing before production release [00:15:11]. This necessitates building robust internal evaluations for agents [00:15:35].
Browser Infrastructure is Crucial:
- The choice of browser infrastructure can significantly impact performance [00:15:53].
- Engineers should test multiple providers, as they are highly interoperable [00:16:06].
- Different systems may offer better CAPTCHA handling or unblocked proxies for specific sites [00:16:22].
Adopt a Hybrid Approach:
- Browser agents excel at certain tasks but have room for improvement [00:16:54].
- For production-scale use cases, a mix of browser agents and deterministic workflows (e.g., Playwright) is recommended [00:17:12].
- Agents are best suited for long-tail, dynamic, or frequently changing workflows [00:17:19].
- Deterministic workflows provide reliability and accuracy for high-volume, constant movement steps [00:17:27].

Future of Browser Agents

The industry is expected to address current problems and improve agent capabilities [00:17:55]. Key areas for development include:

Better Long Context Memory: Essential for accuracy in multi-step write tasks [00:18:04].
Improved Browser Infrastructure Primitives: Addressing blockers like login/authentication and payments will unlock significant value [00:18:21].
Better Underlying Models: Training models in browser and computer use environments will enhance their tool calling and action capabilities [00:18:48].

Notable Browser Agent Behaviors

During benchmarking, some interesting and potentially concerning behaviors were observed:

An agent stuck on GitHub communicated with GitHub’s virtual assistant AI to unblock itself [00:19:26].
An agent’s comment on a Medium article became the top-liked post, raising questions about the Turing test [00:19:45].
Browser agents mistakenly booked restaurant reservations on behalf of users during testing [00:20:05].
When blocked by Cloudflare, an agent searched Google for ways to bypass Cloudflare verification, demonstrating unexpected emergent behavior [00:20:26].

Tubegraph

Explorer

Table of Contents