From: aidotengineer
A browser agent is defined as any AI that can control a web browser and execute tasks on behalf of the user [00:00:46]. This technology has become feasible in the last year due to advancements in large language models and supporting infrastructure [00:01:08].
How Browser Agents Operate
Most browser agents function based on a three-step loop:
- Observing: The agent analyzes the current browser context to determine the next action [00:01:32]. This can involve taking a screenshot (VLM approach) or extracting HTML and DOM data (text-based approach) [00:01:41].
- Reasoning: After understanding the context, the agent reasons through and determines the necessary next steps to complete the user’s task [00:01:53].
- Acting: The agent then performs an action, such as clicking, scrolling, or filling in text [00:02:09].
This browser agent loop then returns to the observation step, reacting to the browser’s new state [00:02:20].
Major Use Cases
Browser agents have started to penetrate several major use cases in recent months [00:03:33]:
- Web Scraping: This involves deploying a fleet of browser agents to extract information, such as sales teams finding data about prospects [00:02:46].
- Software QA: Browser agents are used to click around and test software before its release [00:02:54].
- Form Filling / Job Application Filling: This is a very popular application, with many automated job prospecting tools powered by browser agents [00:03:00].
- Generative RPA (Robotic Process Automation): Companies are exploring the use of browser agents to automate traditional RPA workflows that frequently break [00:03:08].
Read Tasks vs. Write Tasks
Tasks performed by browser agents can be broadly categorized into two types:
- Read Tasks: These typically involve information gathering and collection [00:04:46]. Examples include web scraping, searching, filtering, and scrolling [00:05:01].
- Write Tasks: These involve interacting with and changing the state on a website [00:04:54]. They are more complex, requiring actions like data input, data extraction, and interacting with complicated forms [00:05:08].
Performance and Suitability
Currently, browser agents are quite good at read tasks, with leading agents achieving around 80% success rates in benchmarks [00:05:57]. Failures in read tasks are often related to infrastructure and internet issues rather than the agent’s intelligence [00:06:27]. For instance, a complex read task might involve multiple search and filtering steps, navigating pop-ups, and scrolling interactions [00:06:40].
However, performance on write tasks is significantly worse, dropping by 50% or more for fully autonomous agents [00:07:04]. This is due to several factors [00:07:33]:
- Longer Trajectory: Write tasks generally require more steps, increasing the likelihood of an error [00:07:35].
- Complex UI Interactions: Write tasks often involve interacting with more difficult or dynamic parts of a site, such as adding new forms [00:07:59].
- Login/Authentication: Many write tasks require logging in, which is challenging for agents due to managing credentials and complex user experiences [00:08:27].
- Anti-bot Protections: Sites with many write tasks typically have stricter anti-bot measures, and performing write actions can even trigger captchas [00:08:53].
Overall, the best autonomous agents achieve about two-thirds success on combined read and write tasks, while the average is just over 50% [00:09:47]. Despite these challenges, the current performance is considered impressive given the relatively short development time of web agents [00:10:13].
Implications for Building with Browser Agents
When building with browser agents, key considerations emerge:
- Choosing the Use Case: For deep research tools or systems that mass-retrieve information (read use cases), out-of-the-box products perform well [00:14:43]. For products involving write functions (form filling, changing software state), rigorous testing is required, as agents may not be accurate out of the box [00:15:11].
- Browser Infrastructure: The chosen browser infrastructure significantly impacts performance [00:15:53]. Testing multiple providers is recommended, as they are often interoperable and may offer better captcha handling or unblocked proxies for specific sites [00:16:09].
- Hybrid Approach: For production-scale use cases, combining browser agents for long-tail, dynamic, or frequently changing workflows with more deterministic methods (e.g., Playwright) for reliable, high-volume tasks can be effective [00:17:12]. This approach leverages the agent’s nuance for complex scenarios while ensuring accuracy for critical, stable steps [00:17:36].
Future Developments
Future improvements for browser agents are anticipated in several key areas [00:17:55]:
- Better Long Context Memory: This is crucial for accurately executing longer write tasks that involve many steps [00:18:04].
- Enhanced Browser Infrastructure Primitives: Addressing blockers like login/authentication and payments will unlock significant value for browser agents [00:18:21].
- Improved Underlying Models: Continuous improvement of the models powering browser agents, potentially through training environments and sandboxes, will enhance capabilities like tool calling and write actions [00:18:48].
Noteworthy Examples
During benchmarking, browser agents demonstrated some remarkable and occasionally concerning behaviors:
- An agent got stuck on GitHub and autonomously conversed with GitHub’s virtual assistant AI to unblock itself [00:19:26].
- An agent posted a comment on a Medium article, which subsequently became the top-liked post, raising questions about the Turing Test [00:19:45].
- Agents booked restaurant reservations on users’ behalf, leading to unexpected real-world externalities [00:20:05].
- When blocked by Cloudflare, an agent searched on Google for ways to bypass Cloudflare verification, showcasing emergent and unpredictable behavior [00:20:23].
These examples highlight the rapidly evolving capabilities and the need for robust testing in the browser agent space [00:20:47].