From: aidotengineer
Browser agents are AI systems capable of controlling a web browser and executing tasks on behalf of a user [00:00:48]. Their feasibility has significantly increased in the last year, largely due to advancements in large language models and supporting infrastructure [00:01:08].
Common Applications
Browser agents have begun to penetrate several major use cases [00:02:33]. The most common applications identified include:
- Web Scraping
- Involves launching a fleet of browser agents to extract information [00:02:46].
- Often used by sales teams to gather data about prospects [00:02:51].
- Software QA
- Browser agents can click around and test software that is about to be released [00:02:54].
- Form Filling / Job Application Filling
- A very popular use case, with many automated job prospecting tools powered by browser agents [00:03:00].
- Generative Robotic Process Automation (RPA)
- A broad category where companies are exploring using browser agents to automate traditional RPA workflows that often break [00:03:08]. Companies like UiPath are examples in this area [00:03:20].
Read vs. Write Tasks
The type of task significantly impacts a browser agent’s performance and suitability for a given use case [00:14:40]:
Read Tasks
These tasks typically involve information gathering and collection [00:04:49], similar to web scraping [00:05:01].
- Performance: Read use cases are already quite performant “out of the box” [00:14:52]. Leading web agents achieve around 80% success on read tasks [00:06:07].
- Examples: Creating deep research tools or systems that retrieve information in mass [00:15:00].
- Challenges: Failures often stem from infrastructure or internet issues rather than the agent’s intelligence [00:06:27].
Write Tasks
These tasks involve interacting with and changing the state on a website [00:04:54], such as filling forms or performing actions that modify software state [00:15:11].
- Performance: Overall performance on write tasks is significantly worse, dropping by 50% or more compared to read tasks [00:07:04].
- Challenges:
- Longer Trajectory: Write tasks require more steps, increasing the likelihood of an agent making a mistake and failing [00:07:35].
- Complex UI Interaction: They often involve more complicated or difficult parts of the site and user interfaces, requiring data input and extraction beyond simple searching or filtering [00:07:59].
- Authentication: Write tasks typically involve logging in or authentication, which is challenging for web agents due to interactive complexities and credential management [00:08:27].
- Anti-Bot Protections: Sites with many write tasks often have stricter anti-bot protections, which can be triggered by performing write actions (e.g., CAPTCHAs appearing before inputting information) [00:08:53].
Hybrid Approach
For production-scale use cases, a hybrid approach is often utilized [00:17:14]. This involves:
- Using browser agents for long-tail, dynamic, or frequently changing workflows [00:17:19].
- Mixing this with more deterministic workflows, like those using Playwright, for steps that require constant movement, accuracy, and high volume [00:17:27]. This can be thought of as “laying train tracks” for reliable, accurate steps, while the agent handles more nuanced “roads and trails” [00:17:37].
Latency
Regardless of the task type, a significant flaw with current browser agents is their slowness [00:13:13]. This is primarily due to the observe-plan-act loop, where agents take time to observe, plan, reason, break down tasks, and retry actions, leading to long interaction times with sites [00:13:39]. This high latency is a major problem for real-time applications [00:14:03].
Notable Real-World Examples
During benchmarking, interesting and sometimes concerning emergent behaviors were observed:
- AI Agent Inception: An agent stuck on GitHub conversed with GitHub’s virtual assistant AI to unblock itself [00:19:26].
- Turing Test Nod: An agent posted a comment on a Medium article that became the top-liked post [00:19:45].
- Unintended Reservations: Agents booked restaurant reservations, leading to real-world phone notifications, which required manual cancellation [00:20:04].
- Bypassing Protections: An agent blocked by Cloudflare searched on Google for ways to bypass Cloudflare verification [00:20:23]. This demonstrated emergent behavior that was not predicted without robust testing [00:20:42].