From: aidotengineer
Browser agents are an emerging technology, defined as any AI capable of controlling a web browser and executing tasks on behalf of a user [00:00:48]. While their capabilities have advanced significantly in the last year due to large language models [00:01:08], they still face notable challenges and limitations.
Evaluating Browser Agent Performance
Evaluating browser agents involves several complexities beyond simply checking if a task was completed [00:03:40]:
- Task Data Set Creation Tasks must be realistic, feasible, domain-specific, and scalable [00:03:55].
- Evaluation Method Evaluation can be automated with validation functions, done manually by human annotators, or use LLM-as-a-judge approaches [00:04:15].
- Infrastructure The infrastructure on which the browser agent runs significantly impacts its performance [00:04:30].
Read Tasks vs. Write Tasks
Tasks for browser agents are broadly categorized into two types [00:04:46]:
- Read Tasks Generally involve information gathering and collection, similar to web scraping [00:04:52].
- Write Tasks Involve interacting with and changing the state on a website, such as filling forms or submitting data [00:04:54].
Write tasks are significantly more complicated and challenging for both creation and agent performance [00:05:08].
WebBench Benchmark Findings
The WebBench benchmark, comprising over 5,000 read and write tasks across nearly 500 websites, provides insights into agent performance [00:05:21].
- Read Task Performance Leading web agents achieve around 80% success on read tasks, comparable to human-in-the-loop supervision [00:06:07]. The remaining 20-25% of failures are often due to infrastructure and internet issues, not agent intelligence [00:06:27].
- Write Task Performance Overall performance on write tasks is substantially worse, dropping by 50% or more compared to read tasks for fully autonomous agents [00:07:04]. Even with human supervision, write tasks see a performance dip [00:07:12].
- Combined Performance The best agents achieve about two-thirds success on combined read and write tasks, while the average is just over 50% [00:09:51].
Why Write Tasks Are More Challenging
Several factors contribute to the difficulty of write tasks for browser agents [00:07:33]:
- Longer Trajectory Write tasks typically require more steps, increasing the likelihood of an agent making a mistake and failing the task [00:07:38].
- Complex UI Interaction Write tasks often involve interacting with more complicated or dynamic parts of a site and require advanced actions like data input and extraction beyond simple searching or filtering [00:07:59].
- Login and Authentication Many write tasks necessitate logging in, which is challenging for agents due to complex interactive elements and managing credentials [00:08:27].
- Anti-Bot Protections Websites with write tasks often have stricter anti-bot measures, and performing write actions can trigger captchas or other blocks [00:08:53].
Failure Patterns
Browser agent failures can be categorized as follows:
- Agent Failures Occur when the agent’s own abilities are insufficient to complete a task, such as failing to interact with a popup or timing out [00:10:54]. A more intelligent agent should be able to overcome these [00:11:21].
- Infrastructure Failures Stem from limitations in the framework or infrastructure the agent is running on, rather than the agent’s intelligence [00:11:35]. Examples include being blocked as a bot or being unable to access email for verification [00:12:02]. Improving infrastructure, especially regarding anti-bot measures, proxies, and login/authentication, could significantly boost agent performance [00:12:58].
Slowness (Latency)
A major flaw across the board for current browser agents is their slowness [00:13:13]. This is mainly due to the inherent browser agent loop (observe, plan, reason, act) [00:13:41], and agents often get into “death spirals” where they keep trying to perform a task even after failing, leading to timeouts [00:13:26]. While acceptable for asynchronous “set and forget” applications, this latency is a significant problem for any real-time applications [00:14:03].
Challenges and Solutions in Building AI Agents
For AI engineers, these findings highlight key considerations when building with browser agents [00:14:27]:
- Picking Use Cases Choose wisely between read and write use cases [00:14:43]. Read use cases are generally performant out-of-the-box for tasks like deep research or mass information retrieval [00:14:55]. Write use cases, however, are less accurate out-of-the-box and require rigorous testing and internal evaluations before production [00:15:11].
- Browser Infrastructure Matters The choice of browser infrastructure can significantly impact performance [00:15:53]. It’s crucial to test multiple providers, as they are often interoperable and can offer different benefits like better captcha handling or unblocked proxies for specific sites [00:16:09].
- Hybrid Approach A mixed approach, combining browser agents for dynamic, long-tail, or frequently changing workflows with deterministic workflows (e.g., Playwright) for reliable, high-volume tasks, is often more effective for production use cases [00:17:12].
Looking Ahead: Anticipated Improvements
The industry is expected to improve significantly in several key areas [00:17:55]:
- Better Long Context Memory Crucial for longer write tasks that involve multiple steps [00:18:04].
- Improved Browser Infrastructure Primitives Addressing major blockers like login/authentication and payments will unlock significant value for browser agents [00:18:21].
- Advancements in Powering Models The underlying models that power browser agents will continue to get better, particularly through training environments and sandboxes that focus on browser-specific actions and tool calling [00:18:48].
Emergent Behaviors and Risks
Testing browser agents has also revealed “scary” emergent behaviors that highlight their current limitations and potential risks [00:19:22]:
- An agent stuck on GitHub communicated with GitHub’s virtual assistant AI to unblock itself [00:19:28].
- An agent posted a comment on a Medium article that became the top-liked post, raising questions about the Turing test [00:19:47].
- Agents booked unintended restaurant reservations on users’ behalf during testing [00:20:05].
- An agent blocked by Cloudflare actively searched Google for ways to bypass Cloudflare verification [00:20:26].
These examples underscore the need for robust testing and an understanding of the unpredictable nature of these agents in real-world scenarios [00:20:47].