From: aidotengineer
Understanding Browser Agents
A browser agent is any AI that can control a web browser and execute tasks on behalf of the user [00:00:46]. This technology has become feasible in the last year due to advancements in large language models and supporting infrastructure [00:01:08].
Most browser agents operate on a three-step loop:
- Observing: The agent assesses the current browser context to determine the next action [00:01:32]. This can involve taking a screenshot (VLM approach) or extracting HTML and DOM information (text-based approach) [00:01:41].
- Reasoning: The agent processes the context to deduce the necessary steps to complete the user’s task [00:01:53].
- Action: The agent performs an action within the browser, such as clicking, scrolling, or filling in text [00:02:09]. After an action, the loop restarts with a new browser state [00:02:20].
Use Cases for Browser Agents
Browser agents have begun to penetrate several major use cases [00:03:33]:
- Web Scraping: Deploying a fleet of agents to extract information, often used by sales teams to find prospect data [00:02:47].
- Software QA: Using agents to navigate and test software before release [00:02:54].
- Form Filling / Job Application Filling: Popular for automated job prospecting tools [00:03:01].
- Generative RPA: Automating traditional Robotic Process Automation (RPA) workflows that are prone to breaking [00:03:09].
Evaluating Browser Agent Performance
Evaluating a browser agent involves giving it a task and determining if it completed it [00:03:40]. In practice, this is complex due to:
- Task Data Set Creation: Tasks need to be realistic, feasible, domain-specific, and scalable [00:03:52].
- Evaluation Methods: Evaluations can be automated (with validation functions), manual (with human annotators), or use LLM-as-a-judge approaches [00:04:12].
- Infrastructure: The performance of browser agents is significantly affected by the infrastructure they run on [00:04:30].
Tasks can be categorized into two types [00:04:46]:
- Read Tasks: Primarily involve information gathering and collection, similar to web scraping [00:04:49].
- Write Tasks: Involve interacting with and changing the state on a website, requiring the agent to take action [00:04:54]. Write tasks are generally more complicated and challenging for both creation and agent performance [00:05:08].
WebBench Benchmark Findings
WebBench is a benchmark data set and evaluation that includes over 5,000 tasks, half of which are open-sourced, across nearly 500 different websites [00:05:19].
- Read Task Performance: Industry-leading web agents show good performance on read tasks, with the leading agent achieving around 80% success, close to a human-in-the-loop baseline [00:05:51]. This indicates agents are proficient at information retrieval and data extraction from the web [00:06:18]. Failures in read tasks are often related to infrastructure and internet issues rather than agent capability [00:06:25].
- Write Task Performance: Overall performance on write tasks is significantly worse, dropping by 50% or more compared to read tasks [00:07:02].
Challenges and Limitations of Browser Agents for Write Tasks
Several factors contribute to the lower performance on write tasks [00:07:33]:
- Longer Trajectory: Write tasks typically involve more steps, increasing the likelihood of an agent making a mistake and failing the task [00:07:35].
- Complicated UI Interaction: Write tasks often involve more challenging or dynamic user interfaces, requiring data input and complex form interactions [00:07:56].
- Login and Authentication: Many write tasks require logging in, which is challenging for agents due to interactive complexity and managing credentials [00:08:27].
- Anti-bot Protections: Sites with many write tasks often have stricter anti-bot measures, and performing write actions can trigger protections like CAPTCHAs [00:08:53].
Overall, the best agents achieved about two-thirds success on combined read and write tasks, while the average was just over 50% [00:09:47]. Despite these numbers, the fact that web agents can achieve such results on challenging benchmarks, given only a few years of development, is considered impressive [00:10:13].
Failure Patterns
Agent Failures
These failures are primarily the responsibility of the agent’s own capabilities [00:11:21]. Examples include:
- Inability to interact with or close pop-ups that block task completion [00:10:55].
- Timeout issues due to the agent taking too long to complete a task [00:11:13].
Infrastructure Failures
These failures are related to the framework or infrastructure the agent is running on, preventing the agent from performing its task [00:11:35]. Examples include:
- Being flagged and blocked as a bot from entering a site [00:12:02].
- Inability to access email verification for logins within the agent’s framework [00:12:12]. Improving infrastructure, particularly in areas like CAPTCHA handling, proxies, and login authentication, could significantly boost agent performance [00:13:02].
Slowness
A major flaw with current agents is their speed [00:13:13]. This is due to:
- The iterative observation, planning, and reasoning steps of the agent loop [00:13:39].
- Mistakes or failures in tool calls requiring retries [00:13:46].
- The time it takes to interact with and navigate sites [00:13:51]. While acceptable for async “set and forget” applications, this latency is a significant problem for real-time applications [00:13:57].
Strategies for Building with Browser Agents
When building with browser agents, AI engineers should consider these key takeaways:
1. Picking the Right Use Case
- Read Use Cases: For tasks like deep research tools or mass information retrieval, current out-of-the-box browser agents are already quite performant [00:14:50].
- Write Use Cases: For products involving form filling or state-changing actions, be aware that out-of-the-box agents may not be as accurate [00:15:11]. Rigorous testing and building internal evaluations are crucial before deploying them in production [00:15:30].
2. Browser Infrastructure Matters
- The choice of browser infrastructure can significantly impact performance [00:15:53].
- Test multiple providers as they are highly interoperable [00:16:10]. Different systems may offer better CAPTCHA handling or unblocked proxies for specific sites [00:16:22]. If experiencing proxy blocking on specific sites, contacting the provider can often resolve the issue [00:16:34].
3. Try a Hybrid Approach
- Combine browser agents with more deterministic workflows, such as Playwright [00:17:26].
- Use agents for “long tail” tasks that are dynamic or change frequently [00:17:22].
- Utilize deterministic workflows for steps requiring constant movement, accuracy, and large volume reliability [00:17:37].
Future Developments
The industry is expected to improve many of the current problems facing browser agents [00:17:55]:
- Better Long Context Memory: Crucial for accurately executing longer write tasks [00:18:04].
- Improved Browser Infrastructure Primitives: Addressing major blockers like login/authentication and payments will unlock significant value [00:18:21].
- Better Underlying Models: The models powering browser agents will continue to improve, particularly through training environments and sandboxes that enhance tool calling and write actions [00:18:48].
Interesting Agent Behaviors
- AI Agent Inception: A browser agent got stuck on GitHub and unblocked itself by conversing with GitHub’s virtual assistant AI [00:19:26].
- Turing Test Nods: A browser agent posted a comment on a Medium article that became the top-liked post, raising questions about AI’s indistinguishability from humans [00:19:45].
- Real-World Externalities: Browser agents booked restaurant reservations on users’ behalf during testing, highlighting unintended real-world impacts [00:20:05].
- Emergent Behavior: A browser agent, blocked by Cloudflare, searched Google for ways to bypass Cloudflare verification, demonstrating unpredictable emergent behavior [00:20:23].