From: aidotengineer
Deep Research, a feature available on Gemini Advanced, is designed to address the challenges of answering complex research and learning queries that traditional chatbots often struggle with [01:13:15]. It functions as a personal research agent capable of browsing the web to build comprehensive reports on behalf of the user [00:51:38].
Motivation for Deep Research
The primary motivation behind building Deep Research was to help users “get smart fast” [01:08:16]. While research queries are a top use case for Gemini, bringing hard questions to general chatbots often results in a “blueprint for an answer” rather than a comprehensive one [01:20:07]. For instance, a query about athletic scholarships for shot put might yield generic advice like “talk to coaches” or “have good grades,” instead of specific details such as grade boundaries or throwing distances [01:28:16].
The goal was to remove constraints on compute and latency, allowing Gemini to take as long as needed (up to 5 minutes) to browse the web and provide a much more comprehensive answer [01:57:02].
Product Challenges and Solutions
Building Deep Research within an inherently synchronous chatbot product like Gemini presented several product challenges [02:19:35]:
- Asynchronous Experience: Integrating a long-running research task into a real-time chat interface [02:28:01].
- Setting User Expectations: Differentiating Deep Research from quick queries like “what’s the weather” where a 5-minute wait is not appropriate [02:34:04].
- Engaging with Long Outputs: Making it easy for users to interact with reports that can be thousands of words long [02:47:01].
To overcome these, the following user experience (UX) and user interface (UI) solutions were implemented:
- Research Plan Card: Upon receiving a complex query, Gemini first presents a research plan in a card format [03:22:04]. This communicates that the experience is different from a standard chatbot and allows users to edit and steer the direction of the research [03:37:11].
- Real-Time Browsing Transparency: While Deep Research is working, it shows the websites it is browsing in real-time, providing transparency into the model’s actions [03:56:06]. Users can click through these websites while waiting [04:12:02].
- Pinned Reports (Artifacts): For long outputs, the generated report can be pinned like an artifact, allowing users to ask questions about the research while reading the material without scrolling back and forth [04:41:00]. This also facilitates changing the report’s style, adding/removing sections, or asking follow-up questions [04:53:23].
- Source Citation: Deep Research always displays all sources read and used in the report, building user trust and acknowledging publishers. These sources can be exported as citations to Google Docs [05:03:09].
Technical Challenges in Building Research Agents
Building a web research agent involves significant technical challenges [00:01:03]:
1. Long-Running Nature of Tasks
Research tasks can run for multiple minutes or even hours, making many LLM calls and service interactions [06:17:01].
- Robustness to Failures: It’s crucial to be robust to intermediate failures of various services, ensuring the entire research task isn’t dropped due to a single failure [06:36:08]. This requires effective state management and error recovery [06:44:11].
- Cross-Platform Enablement: The long-running nature allows users to initiate tasks and receive notifications across devices, enabling them to pick up reading the results later [06:58:08].
2. Iterative Planning and Compute Effectiveness
The model must plan iteratively and effectively manage its time and compute resources [05:49:09].
- Parallel vs. Sequential Problem Solving: For multi-faceted queries (e.g., athletic scholarships), the model must determine which sub-problems can be tackled in parallel versus those that are inherently sequential [07:40:02].
- Handling Partial Information: The model frequently lands in states with partial information. It must assess all information found so far before deciding the next step [07:58:24]. For example, if it finds D1 division standards for shot put, it then needs to plan to find D2 and D3 equivalents [08:08:12].
- Information Disambiguation: When search results provide partial or ambiguous information (e.g., “top 10 roller coasters” without mentioning kid-friendliness), the planner must recognize this and plan further steps to resolve the ambiguity [08:44:10].
- Weaving Dispersed Information: Information for a single answer is often spread across different sources. The model must weave these facets together, like combining scuba diving certification structure from one source with pricing from another [09:10:04].
- Entity Resolution: Identifying if mentions of the same entity across different sources refer to the same thing, requiring reasoning about information indicators or further exploration [09:50:09].
3. Interacting with a Noisy Web Environment
The web is fragmented and inconsistent [10:14:14].
- Robust Browsing: A robust browsing mechanism is essential to navigate varied website layouts and extract information effectively for research tasks [10:34:04].
4. Effective Context Management
As the model performs research and receives streams of information, its context size grows very quickly [10:49:05].
- Follow-up Queries: Research tasks typically involve follow-up questions, adding further pressure on the context [11:04:14].
- Selective Information Retention: While models like Gemini have long contexts, effective management strategies are crucial. One approach involves a recency bias, retaining more information for current and previous tasks, while selectively picking “research notes” from older tasks and putting them into a Retrieval Augmented Generation (RAG) system so the model can still access them [11:37:07].
Future Directions for Research Agents
Future developments for research agents, including Gemini’s Deep Research, are envisioned in several key areas [13:36:06]:
- Expertise: Moving beyond aggregation and synthesis to providing deeper insights, implications, and novel hypotheses, akin to a “Mackenzie partner” or “Goldman Sachs partner” [12:40:02]. This could apply to professional services or scientific domains [13:03:04].
- Personalization: Customizing the way information is browsed, framed, and presented based on the user’s specific role and needs (e.g., presenting financial due diligence differently to a general user versus a banker) [13:31:07].
- Multimodal Capabilities: Combining web research with other model abilities such as coding, data science, or video generation to enrich research outputs (e.g., building financial models or statistical analyses to inform due diligence) [14:11:04]. This aligns with developing custom AI tools and functions.