Design challenges in building web research agents

From: aidotengineer

Google’s Deep Research feature within Gemini Advanced aims to address the challenge of providing comprehensive answers to complex research and learning queries that traditional chatbots struggle with [00:01:08]. Unlike general chatbots that might offer a “blueprint” for an answer, Deep Research is designed to deliver a full, detailed report by extensively browsing the web [00:01:25]. This requires the system to operate with fewer constraints on compute and latency, taking up to five minutes to generate a thorough response [00:01:57].

Product Challenges and Solutions

Building Deep Research presented several product challenges due to Gemini’s inherently synchronous chatbot nature [00:02:17].

Challenges

Asynchronous Experience: Integrating a long-running, asynchronous feature into a synchronous chatbot product [00:02:28].
Setting User Expectations: Differentiating Deep Research for complex queries from quick requests like weather updates or jokes, where a five-minute wait is inappropriate [00:02:34].
Engaging with Long Outputs: Designing an interface that allows users to easily interact with reports that can be thousands of words long [00:02:47].

UX Solutions

To address these challenges, the following user experience (UX) elements were implemented:

Research Plan Card: When a query is submitted, Gemini first generates and presents a research plan in a card [00:03:22]. This immediately signals that this is not a standard chatbot interaction and allows users to review and even edit the plan, similar to how an analyst would present their approach [00:03:36].
Real-time Browsing Display: As the research progresses, the system shows the websites Gemini is browsing in real time [00:03:56]. This provides transparency and allows users to click through to the sources while they wait. An unexpected side effect was users attempting to “game” the system by pushing the number of browsed websites into the thousands [00:04:18].
Pinned Report (Artifact): Inspired by Anthropic’s artifacts, the final report is pinned, enabling users to ask follow-up questions about the research directly within the chat interface without needing to scroll back and forth [00:04:37]. This facilitates changing the report’s style, adding or removing sections, and continuing the conversation [00:04:52].
Source Citation: To build user trust and support publishers, Deep Research always displays all sources read and specifically those used in the report [00:05:03]. Even sources read but not directly used are kept in context for potential follow-up questions and are carried over as citations if the report is exported to Google Docs [00:05:13].

Technical Challenges in Building a Research Agent

Developing a web research agent like Deep Research involves several significant technical challenges:

Long-Running Nature of Tasks

Research tasks can run for multiple minutes, making many LLM calls and service interactions [00:06:17]. This introduces the inevitability of failures [00:06:23].

Robustness to Failures: It’s crucial to build a robust system that can handle intermediate failures of various services with differing reliabilities [00:06:36].
State Management and Error Recovery: Effective state management and error recovery are essential to prevent the entire research task from failing due to a single component error [00:06:44].
Cross-Platform Enablement: The ability to register research tasks, walk away, and receive notifications across different devices allows users to pick up where they left off [00:06:55].

Iterative Planning and Compute Management

The agent’s model must plan iteratively and efficiently manage its time and computational resources [00:05:49].

Parallel vs. Sequential Problem Solving: The model needs to determine which sub-problems within a query can be tackled in parallel and which require sequential processing [00:07:40].
Handling Partial Information: The agent frequently encounters situations with partial information. It must be able to assess all information found so far before deciding on the next steps, grounding its plans on discovered data [00:07:58]. For instance, if it finds D1 division standards for a scholarship, it needs to recognize the need to find D2 and D3 equivalents [00:08:06]. Similarly, if a search yields a “top 10” list without mentioning suitability for kids, the planner must recognize this ambiguity and plan to resolve it [00:08:32].
Information Aggregation: Information is often spread across multiple sources [00:09:10]. The model must weave together facets of information from different websites to form a complete picture, such as combining certification steps from one source with pricing from another [00:09:39].
Entity Resolution: The classic problem of identifying if mentions across different sources refer to the same entity is critical, requiring the model to reason about information indicators or explore further to verify [00:09:50].

Interacting with a Noisy Web Environment

The web is a fragmented and noisy environment [00:10:14].

Robust Browsing Mechanism: Websites have varied layouts and structures. A robust browsing mechanism is necessary for effective navigation and information extraction, regardless of the website’s design [00:10:34].

Context Management

As the agent processes information, especially in long-running tasks, the context size can grow rapidly [00:10:46].

Handling Follow-up Queries: Research tasks often involve follow-up questions or new topics, adding further pressure on context management [00:11:02].
Long Context Models: While Gemini has access to long context models, effective management strategies are still needed [00:11:20]. One approach involves a recency bias, keeping more detailed information about current and previous tasks, while selectively extracting “research notes” for older tasks to be accessed via a retrieval-augmented generation (RAG) system [00:11:37].

Future Directions for Research Agents

The success of Deep Research suggests significant potential for future development in research agents [00:12:13]. Key directions include:

Expertise: Moving beyond aggregation and synthesis to provide deeper insights, implications, and novel patterns, akin to a “Mackenzie partner” or expert in scientific domains who can form hypotheses [00:12:42].
Personalization: Tailoring information presentation and the research process itself (browsing, framing answers, questions pursued) to the specific user and their needs [00:13:22]. For example, a due diligence report for a general user would differ from one for a Goldman Sachs banker [00:13:30].
Multimodality: Combining web research with other capabilities such as coding, data science, or video generation to enrich research outputs, like performing statistical analysis or building financial models to inform a business due diligence report [00:14:11].

Tubegraph

Explorer

Table of Contents