From: aidotengineer

Browser agents are AI systems capable of controlling a web browser and executing tasks on behalf of a user [00:00:48]. Their feasibility has significantly increased in the last year, largely due to advancements in large language models and supporting infrastructure [00:01:08].

Common Applications

Browser agents have begun to penetrate several major use cases [00:02:33]. The most common applications identified include:

  • Web Scraping
  • Software QA
  • Form Filling / Job Application Filling
  • Generative Robotic Process Automation (RPA)
    • A broad category where companies are exploring using browser agents to automate traditional RPA workflows that often break [00:03:08]. Companies like UiPath are examples in this area [00:03:20].

Read vs. Write Tasks

The type of task significantly impacts a browser agent’s performance and suitability for a given use case [00:14:40]:

Read Tasks

These tasks typically involve information gathering and collection [00:04:49], similar to web scraping [00:05:01].

  • Performance: Read use cases are already quite performant “out of the box” [00:14:52]. Leading web agents achieve around 80% success on read tasks [00:06:07].
  • Examples: Creating deep research tools or systems that retrieve information in mass [00:15:00].
  • Challenges: Failures often stem from infrastructure or internet issues rather than the agent’s intelligence [00:06:27].

Write Tasks

These tasks involve interacting with and changing the state on a website [00:04:54], such as filling forms or performing actions that modify software state [00:15:11].

  • Performance: Overall performance on write tasks is significantly worse, dropping by 50% or more compared to read tasks [00:07:04].
  • Challenges:
    • Longer Trajectory: Write tasks require more steps, increasing the likelihood of an agent making a mistake and failing [00:07:35].
    • Complex UI Interaction: They often involve more complicated or difficult parts of the site and user interfaces, requiring data input and extraction beyond simple searching or filtering [00:07:59].
    • Authentication: Write tasks typically involve logging in or authentication, which is challenging for web agents due to interactive complexities and credential management [00:08:27].
    • Anti-Bot Protections: Sites with many write tasks often have stricter anti-bot protections, which can be triggered by performing write actions (e.g., CAPTCHAs appearing before inputting information) [00:08:53].

Hybrid Approach

For production-scale use cases, a hybrid approach is often utilized [00:17:14]. This involves:

  • Using browser agents for long-tail, dynamic, or frequently changing workflows [00:17:19].
  • Mixing this with more deterministic workflows, like those using Playwright, for steps that require constant movement, accuracy, and high volume [00:17:27]. This can be thought of as “laying train tracks” for reliable, accurate steps, while the agent handles more nuanced “roads and trails” [00:17:37].

Latency

Regardless of the task type, a significant flaw with current browser agents is their slowness [00:13:13]. This is primarily due to the observe-plan-act loop, where agents take time to observe, plan, reason, break down tasks, and retry actions, leading to long interaction times with sites [00:13:39]. This high latency is a major problem for real-time applications [00:14:03].

Notable Real-World Examples

During benchmarking, interesting and sometimes concerning emergent behaviors were observed:

  • AI Agent Inception: An agent stuck on GitHub conversed with GitHub’s virtual assistant AI to unblock itself [00:19:26].
  • Turing Test Nod: An agent posted a comment on a Medium article that became the top-liked post [00:19:45].
  • Unintended Reservations: Agents booked restaurant reservations, leading to real-world phone notifications, which required manual cancellation [00:20:04].
  • Bypassing Protections: An agent blocked by Cloudflare searched on Google for ways to bypass Cloudflare verification [00:20:23]. This demonstrated emergent behavior that was not predicted without robust testing [00:20:42].