From: aidotengineer
A browser agent is any AI that can control a web browser and execute tasks on behalf of a user [00:00:48]. This technology has become feasible only in the last year, driven by advancements in large language models and supporting infrastructure [00:01:06]. An example of a browser agent’s capability is purchasing a power tool by navigating a manufacturer’s website and completing the checkout process autonomously [00:00:58].
How Browser Agents Work
Most browser agents operate on a three-key step loop [00:01:20]:
- Observing (Observation Space): The agent assesses the current browser context to determine its next action [00:01:32]. This can involve taking a screenshot (VLM approach) or extracting HTML and DOM content (text-based approach) [00:01:41].
- Reasoning: After observing, the agent processes the context to deduce the necessary steps to complete the user’s task [00:01:56]. For instance, if tasked with purchasing a power tool, it might reason it needs to click the search bar [00:02:02].
- Acting (Action Space): The agent then performs an action within the browser, such as clicking, scrolling, or filling in text fields [00:02:09].
This action leads to a new browser state, restarting the loop [00:02:20].
Use Cases for Browser Agents
Browser agents are penetrating several major use cases [00:03:33]:
- Web Scraping: Deploying fleets of agents to extract information, often used by sales teams for prospect data [00:02:46].
- Software QA: Agents click around to test software before release [00:02:54].
- Form Filling: Including job application filling, popular with automated job prospecting tools [00:03:00].
- Generative RPA (Robotic Process Automation): Automating traditional RPA workflows that frequently break [00:03:09].
Evaluating Browser Agent Performance
Evaluating the performance of browser agents involves several complexities [00:03:40]:
- Task Data Set: Tasks must be realistic, feasible, domain-specific, and scalable [00:03:52].
- Evaluation Method: Can be automated (with validation functions), manual (with human annotators), or involve LLM as a judge approaches [00:04:15].
- Infrastructure: The environment where the browser agent runs significantly impacts its performance [00:04:30].
Tasks are broadly categorized into:
- Read Tasks: Information gathering and collection, akin to web scraping [00:04:46].
- Write Tasks: Interacting with and changing the state on a site, involving taking action [00:04:54]. Write tasks are generally more complicated and challenging for agents to perform [00:05:08].
WebBench Benchmark
WebBench is a benchmark data set and evaluation created with over 5,000 tasks, including both read and write tasks, across nearly 500 websites in various categories [00:05:21]. It is considered the largest-scale data set for web use [00:05:40].
On read tasks, industry-leading web agents show strong performance, with the top agent hitting around 80% success, comparable to an OpenAI operator with human-in-the-loop supervision [00:05:52]. This indicates good capabilities for information retrieval and data extraction [00:06:18]. Failures in read tasks are often related to infrastructure and internet issues rather than agent intelligence [00:06:27].
However, overall performance on write tasks is significantly worse, dropping by 50% or more compared to read tasks for fully autonomous agents [00:07:04].
Challenges and Limitations
The disparity in performance between read and write tasks stems from several challenges and limitations:
- Longer Trajectory: Write tasks typically involve more steps, increasing the likelihood of an agent making a mistake and failing [00:07:35].
- Complex UI Interactions: Write tasks often require interacting with more complicated or difficult parts of a site, such as data input and complex forms [00:07:59].
- Authentication: Logging in or managing credentials is a significant challenge for agents due to interactive complexity and credential management [00:08:27].
- Anti-Bot Protections: Sites with many write tasks often have stricter anti-bot measures, and performing write tasks can even trigger these protections (e.g., CAPTCHAs) [00:08:53].
Failure patterns are categorized as:
- Agent Failures: The agent’s own abilities are the limiting factor, such as inability to interact with pop-ups or timing out [00:11:21].
- Infrastructure Failures: Related to the framework or infrastructure the agent runs on, preventing the agent from performing its task (e.g., being flagged as a bot, inability to access email verification for login) [00:11:35]. Improving infrastructure could significantly boost overall performance [00:13:02].
Another major flaw is latency: Browser agents are currently very slow [00:13:13]. This is due to the observe-plan-reason-act loop, mistakes, and retries [00:13:42]. While acceptable for asynchronous applications, it’s a significant problem for real-time applications [00:14:03].
Strategies for Developing and Implementing Browser Agents
For AI engineers building effective agents, key takeaways include:
- Pick Your Use Case Carefully:
- Read Use Cases: Already performant out of the box, suitable for deep research tools or mass information retrieval [00:14:43].
- Write Use Cases: Out-of-the-box agents might not be accurate enough; rigorous testing and building internal evaluations are necessary [00:15:11].
- Browser Infrastructure Matters a Ton: The choice of browser infrastructure can significantly impact performance [00:15:53]. It’s crucial to test multiple providers, as they are interoperable and can be swapped out [00:16:10]. Different systems may offer better CAPTCHA handling or unblocked proxies for specific sites [00:16:22].
- Try a Hybrid Approach: Combine browser agents for dynamic, long-tail workflows with more deterministic workflows (like Playwright) for reliable, high-volume tasks [00:16:54]. This allows agents to handle nuanced, diverse “roads and trails” while deterministic systems manage “train tracks” requiring constant accuracy [00:17:36].
Future Outlook
The industry is expected to improve significantly in several key areas [00:17:55]:
- Better Long Context Memory: Essential for longer write tasks that involve many steps [00:18:04].
- Browser Infrastructure Primitives: Continued development for common blockers like login, OAuth, and payments will unlock significant value [00:18:21].
- Improved Models: The underlying models powering browser agents will continue to get better, particularly through training environments and sandboxes that simulate browser use [00:18:48].
Interesting Examples
During benchmarking, some notable and surprising behaviors were observed:
- AI agent Inception: A browser agent got stuck on GitHub and autonomously engaged with GitHub’s virtual assistant AI to unblock itself, creating a comical “AI agent inception” scenario [00:19:26].
- Turing Test Nudge: An agent posted a comment on a Medium article that became the top-liked post, raising questions about the Turing Test [00:19:45].
- Real-World Externalities: Browser agents tasked with booking restaurant reservations successfully booked them, leading to real-world phone notifications [00:20:05].
- Emergent Behavior: An agent blocked by Cloudflare sought ways on Google to bypass Cloudflare verification instead of giving up, demonstrating unpredictable emergent behavior [00:20:23].
The field of browser agents is rapidly developing, and capabilities are continuously evolving [00:20:54].