From: aidotengineer

When it comes to AI agents extracting data from web pages, two primary approaches are employed: vision-based and text-based methods. These methods differ significantly in their underlying technology, performance, cost-effectiveness, and security implications [00:12:02].

Vision-Based Approach

The vision-based approach is commonly used by many existing AI agents, including OpenAI Operator, Anthropic Claude, and Google Mariner [00:12:08].

How it Works

This method involves taking screenshots of web pages and then extracting the desired data from these visual representations [00:12:26].

Disadvantages and Problems

  • Prone to Hallucination: Vision-based models are more susceptible to hallucination compared to text-based methods [00:12:33].
  • High Cost: It is highly expensive, often requiring multiple screenshots for even a single action or page scroll [00:12:42].
  • Cloud-Based Browsers: Many companies using this approach operate browsers on the cloud [00:12:51]. This introduces several issues:
    • Non-Personalized Results: Content seen on a cloud-based browser might differ from what a user sees in their own browser [00:13:04].
    • Expensive Proxies: Supporting cloud browsers necessitates implementing numerous proxies to funnel network requests, which significantly increases costs [00:13:15].
    • Security Risks: Users often have to store or provide their passwords, leading to various security vulnerabilities [00:14:02].
    • Access Limitations: Cloud-based browsers may struggle to bypass paywalls or Cloudflare website protections [00:14:06].
  • Limited Parallel Processing: Vision-based approaches cannot take actions on multiple background tabs simultaneously because these tabs don’t get rendered [00:14:50].
  • Exponential Failure Rates: Tasks that involve long horizons or many steps on a single tab can lead to higher failure rates [00:16:09].

Text-Based Approach (Retriever’s Method)

Retriever.com utilizes a text-based approach for its AI web agent, delivered as a Chrome extension [00:01:20].

How it Works

This method leverages the text-based structure and content of web pages directly [00:12:37].

Advantages and Benefits

  • Reduced Hallucination: There is significantly less hallucination in the output because the text is directly in context for the model [00:14:31].
  • Cost-Effective: Retriever’s approach is highly cost-effective, with page extraction potentially costing less than a penny [00:03:03].
  • Multi-Tab Processing: It can process not only active tabs but also background tabs or multiple tabs simultaneously [00:13:38]. This parallel processing capability speeds up performance and allows for multi-tab contextual actions [00:15:00].
  • Client-Side Extension: Being a client-side Chrome extension, it avoids the issues associated with cloud-hosted browsers [00:13:32].
  • Enhanced Security: Retriever does not store or require users to share passwords, as it operates directly within the user’s logged-in browser session [00:13:50]. This makes it much more secure [00:14:47].
  • Access to Restricted Content: It can access local wall sites and login-protected content, and navigate beyond paywalls or Cloudflare protections [00:13:46].
  • Distributed Subtasks: Instead of long-horizon tasks on a single tab, Retriever distributes subtasks as new tabs, significantly reducing failure rates [00:16:01].
  • Extensible Function Calling: Users can define and set up their own function calls for third-party integrations, making the system more extensible and scalable than fixed custom integrations [00:16:13].

Conclusion

The text-based approach offers significant advantages in terms of accuracy, cost, security, and parallel processing capabilities when compared to vision-based methods for data extraction by AI agents [00:13:33]. Retriever aims to revolutionize data extraction with its transparent and efficient AI agent [00:16:34].