From: redpointai
The development and deployment of AI features at Notion, led by Linus Lee, involved overcoming several significant challenges, particularly when bringing new AI features to market [00:00:15].
Core Challenges in AI Product Development
Notion’s approach to AI product development balances understanding customer needs with exploring new AI capabilities [00:07:06]. This iterative process involves hypothesizing problem statements, rapid prototyping with new technologies like Retrieval-Augmented Generation (RAG), and then refining based on internal dogfooding and user feedback [00:07:30].
Key challenges include:
- Information Organization and Retrieval
- As organizations and Notion workspaces grow in complexity, a significant problem is difficulty in keeping everything organized and finding information [00:06:05]. This is especially true in collaborative environments where users might not know what information others have written down [00:06:22]. Solving this “information finding problem” was a key motivator for features like Notion Q&A [00:06:28].
- Evaluation of Language Models
- A major challenge, especially for features requiring high correctness and quality like Q&A, is the evaluation of language model outputs [00:23:07]. Unlike writing tools where there’s “wiggle room” for varied outputs, Q&A demands specific, accurate answers, leading to clear “mess-ups” if the model is incorrect or uses the wrong document [00:23:48]. This is a complex problem with many sub-components [00:25:28].
- Fuzzy Boundaries and Edge Cases: Anticipating all types of questions users might ask is difficult. Users often ask “meta questions” about Notion itself (e.g., “How can I share this page with Jack?”) or questions involving dynamic time ranges (e.g., “What is the marketing team working on this week?”) that aren’t directly answered by static documents [00:24:19], [00:24:40].
- Building Evaluation Sets: Constructing high-quality evaluation sets for these edge cases and defining criteria for grading them is a hard problem [00:25:11].
- Operational Concerns
- Key operational questions include customer needs around privacy and security, and the required scale for AI features [00:25:40]. Clear industry answers for these aspects are often not available, requiring Notion to start from first principles with customers [00:25:56].
Strategies for Overcoming Challenges in AI Product Development
Notion’s strategy for addressing these challenges in AI Adoption and Deployment involves:
- Rapid Iteration and Dogfooding: During the exploratory phase, quick iteration is paramount [00:09:49]. For Notion Q&A, an “annoying prototype” was constantly active internally, forcing the team to iterate quickly on output quality because everyone was “bombarded with these answers” [00:10:02], [00:10:32]. This internal use provided direct feedback on usefulness and added pressure for improvement [00:10:57].
- In-house Tooling for Evaluation: The majority of Notion’s tools for language model-powered features are built in-house [00:26:21]. This was partly due to the lack of available tools when they started and the complex, structured nature of Notion documents (rich text, tables, images, metadata) which traditional tools didn’t support [00:26:33]. In-house tools also allow for faster iteration and customization of evaluation criteria [00:27:22].
- Hybrid Evaluation Approaches: Notion uses a spectrum of evaluation methods:
- Deterministic/Programmatic Evaluations: Using model-graded outputs for objective checks [00:27:58].
- Human Annotators: A team of human annotators speeds up processes [00:28:38].
- ML Engineer Review: ML engineers directly examine model outputs and datasets to understand why models fail (e.g., misunderstanding instructions, difficulty with relative dates) and identify where to intervene in the pipeline (e.g., embedding problem, ranking problem, or answering logic) [00:28:46], [00:30:37]. This costly but rewarding process helps identify root causes of errors [00:29:11].
- Full-Stack Iteration: Notion iterates across the entire AI stack, from the database layer upwards [00:29:31]. This includes experimenting with additional stages in the RAG pipeline, such as rephrasing retrieved passages to better answer questions <a class=“yt=“yt-timestamp” data-t=“00:29:47”>[00:29:47].
- Strategic Partnerships for Foundational Models: Notion partners with companies like Anthropic and OpenAI for core model building and infrastructure, acknowledging the difficulty of competing at that level [00:31:13]. Notion’s role focuses on understanding specific tasks, collecting/generating synthetic data, and evaluating models for their particular use cases [00:31:40].
- Data Privacy Commitment: Notion has committed to not training models on customer data, which, while a challenge, has pushed them to understand how Notion workspaces are structured and how to generate prototypical documents synthetically [00:31:57].
- Prompt Engineering Expertise: Notion focuses on effective prompt engineering, which is downstream from clear evaluation criteria and a deep understanding of the model’s task [00:37:02]. While minor tweaks might be model-specific, the core understanding of the problem and desired output format translates across different models [00:38:16].
- Multi-Model Deployment: Notion uses different models or providers depending on feature requirements, considering factors like performance, cost, and throughput [00:33:49]. For example, batch processes like autofill might use a model supporting higher throughput [00:34:00].
- Extensive Pre-processing: User prompts are almost always wrapped in Notion’s own processing. This includes:
- Prompt Templates: For AI writer features, user requests are integrated into templates that include past dialogue history and previous model outputs [00:35:01].
- Query Rewrite: For Q&A, there’s a phase where the user’s query is rewritten to incorporate context from multi-turn conversations before searching [00:35:37].
Organizational Structure for Enterprise AI Deployment
Notion’s AI team, consisting of about a dozen engineers and a couple of designers, is divided into two main areas: data model quality (correctness and coherence of outputs) and product concerns (interface and integration) [00:12:05].
Currently, a core AI team largely owns the AI surfaces (e.g., AI writer interface, Q&A chatbox) [00:13:16]. However, features like autofill, which are more integrated with existing Notion products (like databases), are owned by the respective product teams with tight collaboration from the AI team [00:14:01].
The company is still exploring how best to organize its AI efforts, considering options like a central “hub” AI team with liaisons to other teams, or eventually having AI engineers embedded in every product team [00:14:10]. There’s a belief that a central team for foundational AI technologies (like retrieval) will likely remain beneficial for monitoring, quality assurance, data management, and training tools [00:16:33].
User Interaction and Education
Notion leverages internal dogfooding and early external testers (ambassadors, partners) to understand user interaction with Notion AI [00:17:17]. While internal usage is rich, it’s not always representative of all users, which include individuals, students, and varied business users [00:17:31].
- Discovering Unanticipated Use Cases: Early testers often use AI features for unexpected purposes [00:18:09]. For example, translation emerged as a significant use case for autofill outside the US, leading to it becoming a built-in prompt [00:18:30].
- Built-in Prompts vs. Customization: Notion provides “pre-baked” prompts (templates) for common use cases alongside fully customizable options [00:18:48]. Most users utilize the pre-baked prompts like summarization, improving writing, and grammar correction [00:19:32].
- Iterative Use: A significant portion of usage comes from users iterating on model outputs using revision prompts (e.g., “make it shorter” or “more Punchy”) [00:19:50]. This indicates users use pre-built prompts for inspiration and then refine to fit specific needs [00:20:07]. Power users often hand-write and reuse specific prompts for highly tailored tasks [00:20:19].
- Overcoming the Blank Canvas Problem: Providing pre-built prompts helps users overcome the “blank canvas problem” and see the possibilities of AI [00:40:08]. Future improvements might include suggesting revisions based on initial AI-generated content [00:22:11].