Understanding browser agents

From: aidotengineer

A browser agent is any AI that can control a web browser and execute tasks on behalf of a user [00:00:48]. This technology has become feasible only in the last year, driven by advancements in large language models and supporting infrastructure [00:01:06]. An example of a browser agent’s capability is purchasing a power tool by navigating a manufacturer’s website and completing the checkout process autonomously [00:00:58].

How Browser Agents Work

Most browser agents operate on a three-key step loop [00:01:20]:

Observing (Observation Space): The agent assesses the current browser context to determine its next action [00:01:32]. This can involve taking a screenshot (VLM approach) or extracting HTML and DOM content (text-based approach) [00:01:41].
Reasoning: After observing, the agent processes the context to deduce the necessary steps to complete the user’s task [00:01:56]. For instance, if tasked with purchasing a power tool, it might reason it needs to click the search bar [00:02:02].
Acting (Action Space): The agent then performs an action within the browser, such as clicking, scrolling, or filling in text fields [00:02:09].

This action leads to a new browser state, restarting the loop [00:02:20].

Use Cases for Browser Agents

Browser agents are penetrating several major use cases [00:03:33]:

Web Scraping: Deploying fleets of agents to extract information, often used by sales teams for prospect data [00:02:46].
Software QA: Agents click around to test software before release [00:02:54].
Form Filling: Including job application filling, popular with automated job prospecting tools [00:03:00].
Generative RPA (Robotic Process Automation): Automating traditional RPA workflows that frequently break [00:03:09].

Evaluating Browser Agent Performance

Evaluating the performance of browser agents involves several complexities [00:03:40]:

Task Data Set: Tasks must be realistic, feasible, domain-specific, and scalable [00:03:52].
Evaluation Method: Can be automated (with validation functions), manual (with human annotators), or involve LLM as a judge approaches [00:04:15].
Infrastructure: The environment where the browser agent runs significantly impacts its performance [00:04:30].

Tasks are broadly categorized into:

Read Tasks: Information gathering and collection, akin to web scraping [00:04:46].
Write Tasks: Interacting with and changing the state on a site, involving taking action [00:04:54]. Write tasks are generally more complicated and challenging for agents to perform [00:05:08].

WebBench Benchmark

WebBench is a benchmark data set and evaluation created with over 5,000 tasks, including both read and write tasks, across nearly 500 websites in various categories [00:05:21]. It is considered the largest-scale data set for web use [00:05:40].

On read tasks, industry-leading web agents show strong performance, with the top agent hitting around 80% success, comparable to an OpenAI operator with human-in-the-loop supervision [00:05:52]. This indicates good capabilities for information retrieval and data extraction [00:06:18]. Failures in read tasks are often related to infrastructure and internet issues rather than agent intelligence [00:06:27].

However, overall performance on write tasks is significantly worse, dropping by 50% or more compared to read tasks for fully autonomous agents [00:07:04].

Challenges and Limitations

The disparity in performance between read and write tasks stems from several challenges and limitations:

Longer Trajectory: Write tasks typically involve more steps, increasing the likelihood of an agent making a mistake and failing [00:07:35].
Complex UI Interactions: Write tasks often require interacting with more complicated or difficult parts of a site, such as data input and complex forms [00:07:59].
Authentication: Logging in or managing credentials is a significant challenge for agents due to interactive complexity and credential management [00:08:27].
Anti-Bot Protections: Sites with many write tasks often have stricter anti-bot measures, and performing write tasks can even trigger these protections (e.g., CAPTCHAs) [00:08:53].

Failure patterns are categorized as:

Agent Failures: The agent’s own abilities are the limiting factor, such as inability to interact with pop-ups or timing out [00:11:21].
Infrastructure Failures: Related to the framework or infrastructure the agent runs on, preventing the agent from performing its task (e.g., being flagged as a bot, inability to access email verification for login) [00:11:35]. Improving infrastructure could significantly boost overall performance [00:13:02].

Another major flaw is latency: Browser agents are currently very slow [00:13:13]. This is due to the observe-plan-reason-act loop, mistakes, and retries [00:13:42]. While acceptable for asynchronous applications, it’s a significant problem for real-time applications [00:14:03].

Strategies for Developing and Implementing Browser Agents

For AI engineers building effective agents, key takeaways include:

Pick Your Use Case Carefully:
- Read Use Cases: Already performant out of the box, suitable for deep research tools or mass information retrieval [00:14:43].
- Write Use Cases: Out-of-the-box agents might not be accurate enough; rigorous testing and building internal evaluations are necessary [00:15:11].
Browser Infrastructure Matters a Ton: The choice of browser infrastructure can significantly impact performance [00:15:53]. It’s crucial to test multiple providers, as they are interoperable and can be swapped out [00:16:10]. Different systems may offer better CAPTCHA handling or unblocked proxies for specific sites [00:16:22].
Try a Hybrid Approach: Combine browser agents for dynamic, long-tail workflows with more deterministic workflows (like Playwright) for reliable, high-volume tasks [00:16:54]. This allows agents to handle nuanced, diverse “roads and trails” while deterministic systems manage “train tracks” requiring constant accuracy [00:17:36].

Future Outlook

The industry is expected to improve significantly in several key areas [00:17:55]:

Better Long Context Memory: Essential for longer write tasks that involve many steps [00:18:04].
Browser Infrastructure Primitives: Continued development for common blockers like login, OAuth, and payments will unlock significant value [00:18:21].
Improved Models: The underlying models powering browser agents will continue to get better, particularly through training environments and sandboxes that simulate browser use [00:18:48].

Interesting Examples

During benchmarking, some notable and surprising behaviors were observed:

AI agent Inception: A browser agent got stuck on GitHub and autonomously engaged with GitHub’s virtual assistant AI to unblock itself, creating a comical “AI agent inception” scenario [00:19:26].
Turing Test Nudge: An agent posted a comment on a Medium article that became the top-liked post, raising questions about the Turing Test [00:19:45].
Real-World Externalities: Browser agents tasked with booking restaurant reservations successfully booked them, leading to real-world phone notifications [00:20:05].
Emergent Behavior: An agent blocked by Cloudflare sought ways on Google to bypass Cloudflare verification instead of giving up, demonstrating unpredictable emergent behavior [00:20:23].

The field of browser agents is rapidly developing, and capabilities are continuously evolving [00:20:54].

Tubegraph

Explorer

Table of Contents