From: aidotengineer

The field of AI agents includes different approaches for interacting with web pages, primarily distinguishing between vision-based and text-based methods [12:02:00].

Vision-Based Approach

Many existing AI agents, including those from OpenAI (Operator), Anthropic (Claude), and Google (Mariner), utilize a vision-based or hybrid approach [12:08:00], [12:15:00], [12:19:00], [12:21:00]. This method typically involves taking screenshots of web pages to extract data [12:26:00], [12:27:00].

Disadvantages of Vision-Based AI

  • Higher Hallucination Risk Vision-based models are more prone to hallucination compared to text-based approaches [12:33:00], [12:35:00].
  • High Cost This approach is highly expensive due to the need for multiple screenshots for even a single action or page scroll [12:42:00], [12:44:00].
  • Limited Parallel Processing Vision-based methods cannot effectively perform actions on multiple tabs in parallel because background tabs do not render, making it impossible to capture screenshots of them [14:53:00], [14:55:00], [14:58:00].
  • Cloud-Based Browser Issues Many companies using vision-based agents deploy browsers on the cloud, leading to issues such as non-personalized results, as the content displayed might differ from what the user sees in their own browser [13:01:00], [13:04:00], [13:07:00], [13:08:00]. Supporting cloud-based browsers also requires implementing numerous proxies, which is costly [13:15:00], [13:17:00]. Furthermore, these cloud solutions may struggle to access content behind paywalls or Cloudflare website protections, or require users to store passwords, which poses security risks [14:06:00], [14:09:00].

Text-Based Approach (Retriever.com’s method)

Retriever.com employs a text-based approach that leverages the web page’s Document Object Model (DOM) structure [12:37:00], [12:39:00].

Advantages of Text-Based AI

For more information, see also: Comparison of Local and Cloud-based AI Agents.