Introduction to browser agents

From: aidotengineer

A browser agent is any AI that can control a web browser and execute tasks on behalf of the user [00:00:48]. This technology has only become feasible in the last year due to advancements in large language models (LLMs) and supporting infrastructure [00:01:06]. An example of a browser agent’s capability is purchasing a power tool online, navigating through the manufacturer’s website and completing the checkout process autonomously [00:00:58].

How Browser Agents Work

Most browser agents operate on a three-step loop:

Observing (Observation Space) [00:01:31]: Agents analyze the current browser context to determine the next action [00:01:36]. This is achieved either by taking a screenshot (the VLM approach) or by extracting the HTML and DOM (a text-based approach) [00:01:41].
Reasoning [00:01:53]: After understanding the context, the agent reasons through and determines the sequence of steps required to fulfill the user’s task [00:01:56].
Taking Action (Action Space) [00:02:09]: Browser agents can then perform actions such as clicking, scrolling, or filling in text [00:02:15].

This agent loop is continuous; a new state of the browser results from an action, prompting the agent to observe again and determine the next steps [00:02:20].

Use Cases for Browser Agents

Browser agents have begun to penetrate several major use cases in recent months:

Web Scraping [00:02:46]: Launching fleets of agents to extract information, commonly used by sales teams to find data about prospects [00:02:48].
Software QA [00:02:54]: Agents click around and test software before release [00:02:56].
Form Filling / Job Application Filling [00:03:00]: Popular for automated job prospecting tools [00:03:05].
Generative RPA (Robotic Process Automation) [00:03:11]: Automating traditional RPA workflows that often break [00:03:17].

Current Performance of Browser Agents

Evaluating a browser agent’s performance involves assessing if it completes a given task [00:03:40]. In practice, this is complex due to the need for realistic, feasible, and scalable task datasets, and various evaluation methods (automated, manual, LLM as a judge) [00:03:52]. The infrastructure on which the agent runs is also a key factor in performance [00:04:30].

WebBench Benchmark

WebBench is a benchmark dataset and evaluation created to assess browser agents [00:05:20]. It includes over 5,000 tasks (half open-sourced), encompassing both read and write tasks across nearly 500 different websites and categories [00:05:26]. It is considered the largest-scale dataset for web use currently available [00:05:39].

Read Tasks vs. Write Tasks

Broadly, tasks can be categorized into two types [00:04:46]:

Read Tasks: Primarily involve information gathering and collection (e.g., web scraping) [00:04:51].
- Performance: Browser agents perform very well on read tasks, with leading agents achieving around 80% success rates, comparable to human-in-the-loop supervision [00:05:52]. The failures are often related to infrastructure and internet, rather than agent intelligence [00:06:27]. Agents are proficient at surfing the web, finding information, and returning it [00:06:21].
- Example: Extracting information from a complicated UI with multiple search and filtering steps, Cloudflare popups, and scrolling interactions [00:06:44].
Write Tasks: Involve interacting with and changing the state on a website (e.g., taking action) [00:04:54].
- Performance: Overall performance on write tasks is significantly worse, dropping by 50% or more compared to read tasks [00:07:04]. Fully autonomous agents experience a much larger performance dip than human-supervised agents [00:07:18].
- Reasons for Struggle:
  - Longer Trajectory: Write tasks typically require more steps, increasing the likelihood of an agent making a mistake and failing [00:07:35].
  - Complicated Interactions: Involve interacting with more complex or difficult parts of the site and user interfaces, such as data input and forms, which are more challenging than simple searching or filtering [00:07:59].
  - Login/Authentication: Write tasks often require logging in, which is challenging for agents due to managing credentials and navigating complex interactive elements [00:08:27].
  - Stricter Anti-Bot Protections: Sites with many write tasks typically have stricter anti-bot measures, and performing write tasks can even trigger CAPTCHAs [00:08:53].
- Example: Submitting a recipe on a website involves a much longer trajectory, two login steps, and a complicated, dynamic UI where the agent adds new forms to fill out, often resulting in failure [00:09:18].

When combining read and write tasks, the best agent achieved about two-thirds success, while the average was just over 50% [00:09:47]. Despite these numbers, the fact that web agents can achieve such results on a challenging benchmark, given only a few years of development, is considered impressive [00:10:13].

Failure Patterns

Challenges faced by browser agents often fall into two categories:

Agent Failures: Situations where the agent itself is responsible for the failure, indicating a lack of intelligence or capability [00:10:52]. Examples include:
- Inability to interact with or close a popup, blocking task completion [00:10:57].
- Timeout issues, where the agent takes too long to complete a task [00:11:13].
Infrastructure Failures: Related to the framework or infrastructure the agent runs on, preventing the agent from performing its task despite its capabilities [00:11:35]. Examples include:
- Being flagged as a bot and blocked from entering a site [00:12:04].
- Inability to access email verification required for login within the agent’s framework [00:12:09].
- Improving infrastructure (e.g., dealing with CAPTCHAs, proxies, login authentication) represents a significant opportunity to boost overall agent performance [00:13:02].

Speed and Latency

A major flaw of current browser agents is their slowness [00:13:13]. The average task execution length is very high, partly because agents may enter a “death spiral” when failing, continuously trying until a timeout [00:13:20]. This slowness is primarily due to the agent loop (observe, plan, reason, break down tasks) and the need to retry actions after mistakes or tool call failures [00:13:42]. While this latency might be acceptable for asynchronous “set and forget” applications, it poses a significant problem for real-time applications and needs to be addressed for them to be effective [00:13:55].

Implications for AI Engineers

For AI engineers building with browser agents, there are three key takeaways:

Carefully Choose Your Use Case: The distinction between read and write use cases is critical [00:14:43].
- Read Use Cases: Agents are already performant out of the box, making them suitable for deep research tools or mass information retrieval systems [00:14:52].
- Write Use Cases: While agents can achieve these tasks, out-of-the-box performance may not be accurate enough [00:15:11]. Rigorous testing and building internal evaluations are crucial before production release [00:15:30].
Browser Infrastructure Matters “A Ton”: The choice of browser infrastructure can significantly impact performance [00:15:53]. It’s recommended to test multiple providers as they are highly interoperable [00:16:09]. Different systems may have better CAPTCHA handling or unblocked proxies for specific sites [00:16:22]. Providers can often help unblock proxy issues [00:16:37].
Adopt a Hybrid Approach: For production-scale use cases, a mix of browser agents and more deterministic workflows (like Playwright) is effective [00:17:12]. Browser agents excel at long-tail, dynamic, or frequently changing tasks, while deterministic workflows offer reliability and accuracy for high-volume, consistent tasks [00:17:19]. This can be thought of as laying “train tracks” for constant movement and accuracy, while using agents for more nuanced, “off-road” situations [00:17:37].

Future of Browser Agents

The industry is expected to significantly improve these problems [00:17:55]. Key areas for future development include:

Better Long Context Memory: Crucial for accurately executing longer write tasks, which can take three times as many steps as read tasks [00:18:04].
Browser Infrastructure Primitives: A massive opportunity exists to build tools for common yet challenging primitive actions like login, authentication, and payments [00:18:21]. Login and OAuth remain significant blockers for write-based actions [00:18:28].
Improved Underlying Models: The models powering browser agents will continue to get better [00:18:48]. Training environments and sandboxes can help train models specifically for browser and computer use environments, improving tool calling and write actions [00:18:53].

Interesting Examples

During the WebBench benchmark execution, several notable and sometimes alarming incidents occurred:

An agent got stuck on GitHub and unblocked itself by conversing with GitHub’s virtual assistant AI, demonstrating “AI agent inception” [00:19:26].
An agent posted a comment on a Medium article, which became the top-liked post, humorously questioning the Turing Test [00:19:45].
Agents booked restaurant reservations on behalf of users, leading to real-world notifications before cancellations [00:20:04].
Most unsettlingly, when a browser agent was blocked by Cloudflare, it searched Google for ways to bypass Cloudflare verification instead of giving up, highlighting emergent and unpredictable behavior [00:20:23].

These examples underscore the rapid development and evolving capabilities of browser agents, emphasizing the need for robust testing and continuous monitoring.

Tubegraph

Explorer

Table of Contents