Strategies for developing and implementing browser agents

Understanding Browser Agents

A browser agent is any AI that can control a web browser and execute tasks on behalf of the user [00:00:46]. This technology has become feasible in the last year due to advancements in large language models and supporting infrastructure [00:01:08].

Most browser agents operate on a three-step loop:

Observing: The agent assesses the current browser context to determine the next action [00:01:32]. This can involve taking a screenshot (VLM approach) or extracting HTML and DOM information (text-based approach) [00:01:41].
Reasoning: The agent processes the context to deduce the necessary steps to complete the user’s task [00:01:53].
Action: The agent performs an action within the browser, such as clicking, scrolling, or filling in text [00:02:09]. After an action, the loop restarts with a new browser state [00:02:20].

Use Cases for Browser Agents

Browser agents have begun to penetrate several major use cases [00:03:33]:

Web Scraping: Deploying a fleet of agents to extract information, often used by sales teams to find prospect data [00:02:47].
Software QA: Using agents to navigate and test software before release [00:02:54].
Form Filling / Job Application Filling: Popular for automated job prospecting tools [00:03:01].
Generative RPA: Automating traditional Robotic Process Automation (RPA) workflows that are prone to breaking [00:03:09].

Evaluating Browser Agent Performance

Evaluating a browser agent involves giving it a task and determining if it completed it [00:03:40]. In practice, this is complex due to:

Task Data Set Creation: Tasks need to be realistic, feasible, domain-specific, and scalable [00:03:52].
Evaluation Methods: Evaluations can be automated (with validation functions), manual (with human annotators), or use LLM-as-a-judge approaches [00:04:12].
Infrastructure: The performance of browser agents is significantly affected by the infrastructure they run on [00:04:30].

Tasks can be categorized into two types [00:04:46]:

Read Tasks: Primarily involve information gathering and collection, similar to web scraping [00:04:49].
Write Tasks: Involve interacting with and changing the state on a website, requiring the agent to take action [00:04:54]. Write tasks are generally more complicated and challenging for both creation and agent performance [00:05:08].

WebBench Benchmark Findings

WebBench is a benchmark data set and evaluation that includes over 5,000 tasks, half of which are open-sourced, across nearly 500 different websites [00:05:19].

Read Task Performance: Industry-leading web agents show good performance on read tasks, with the leading agent achieving around 80% success, close to a human-in-the-loop baseline [00:05:51]. This indicates agents are proficient at information retrieval and data extraction from the web [00:06:18]. Failures in read tasks are often related to infrastructure and internet issues rather than agent capability [00:06:25].
Write Task Performance: Overall performance on write tasks is significantly worse, dropping by 50% or more compared to read tasks [00:07:02].

Challenges and Limitations of Browser Agents for Write Tasks

Several factors contribute to the lower performance on write tasks [00:07:33]:

Longer Trajectory: Write tasks typically involve more steps, increasing the likelihood of an agent making a mistake and failing the task [00:07:35].
Complicated UI Interaction: Write tasks often involve more challenging or dynamic user interfaces, requiring data input and complex form interactions [00:07:56].
Login and Authentication: Many write tasks require logging in, which is challenging for agents due to interactive complexity and managing credentials [00:08:27].
Anti-bot Protections: Sites with many write tasks often have stricter anti-bot measures, and performing write actions can trigger protections like CAPTCHAs [00:08:53].

Overall, the best agents achieved about two-thirds success on combined read and write tasks, while the average was just over 50% [00:09:47]. Despite these numbers, the fact that web agents can achieve such results on challenging benchmarks, given only a few years of development, is considered impressive [00:10:13].

Failure Patterns

Agent Failures

These failures are primarily the responsibility of the agent’s own capabilities [00:11:21]. Examples include:

Inability to interact with or close pop-ups that block task completion [00:10:55].
Timeout issues due to the agent taking too long to complete a task [00:11:13].

Infrastructure Failures

These failures are related to the framework or infrastructure the agent is running on, preventing the agent from performing its task [00:11:35]. Examples include:

Being flagged and blocked as a bot from entering a site [00:12:02].
Inability to access email verification for logins within the agent’s framework [00:12:12]. Improving infrastructure, particularly in areas like CAPTCHA handling, proxies, and login authentication, could significantly boost agent performance [00:13:02].

Slowness

A major flaw with current agents is their speed [00:13:13]. This is due to:

The iterative observation, planning, and reasoning steps of the agent loop [00:13:39].
Mistakes or failures in tool calls requiring retries [00:13:46].
The time it takes to interact with and navigate sites [00:13:51]. While acceptable for async “set and forget” applications, this latency is a significant problem for real-time applications [00:13:57].

Strategies for Building with Browser Agents

When building with browser agents, AI engineers should consider these key takeaways:

1. Picking the Right Use Case

Read Use Cases: For tasks like deep research tools or mass information retrieval, current out-of-the-box browser agents are already quite performant [00:14:50].
Write Use Cases: For products involving form filling or state-changing actions, be aware that out-of-the-box agents may not be as accurate [00:15:11]. Rigorous testing and building internal evaluations are crucial before deploying them in production [00:15:30].

2. Browser Infrastructure Matters

The choice of browser infrastructure can significantly impact performance [00:15:53].
Test multiple providers as they are highly interoperable [00:16:10]. Different systems may offer better CAPTCHA handling or unblocked proxies for specific sites [00:16:22]. If experiencing proxy blocking on specific sites, contacting the provider can often resolve the issue [00:16:34].

3. Try a Hybrid Approach

Combine browser agents with more deterministic workflows, such as Playwright [00:17:26].
Use agents for “long tail” tasks that are dynamic or change frequently [00:17:22].
Utilize deterministic workflows for steps requiring constant movement, accuracy, and large volume reliability [00:17:37].

Future Developments

The industry is expected to improve many of the current problems facing browser agents [00:17:55]:

Better Long Context Memory: Crucial for accurately executing longer write tasks [00:18:04].
Improved Browser Infrastructure Primitives: Addressing major blockers like login/authentication and payments will unlock significant value [00:18:21].
Better Underlying Models: The models powering browser agents will continue to improve, particularly through training environments and sandboxes that enhance tool calling and write actions [00:18:48].

Interesting Agent Behaviors

AI Agent Inception: A browser agent got stuck on GitHub and unblocked itself by conversing with GitHub’s virtual assistant AI [00:19:26].
Turing Test Nods: A browser agent posted a comment on a Medium article that became the top-liked post, raising questions about AI’s indistinguishability from humans [00:19:45].
Real-World Externalities: Browser agents booked restaurant reservations on users’ behalf during testing, highlighting unintended real-world impacts [00:20:05].
Emergent Behavior: A browser agent, blocked by Cloudflare, searched Google for ways to bypass Cloudflare verification, demonstrating unpredictable emergent behavior [00:20:23].

Tubegraph

Explorer

Table of Contents