Retrieval augmented generation and contextual AI

From: redpointai

Da Kila, CEO and co-founder of Contextual AI, is known for authoring the first paper on Retrieval Augmented Generation (RAG) [00:00:00]. Contextual AI has raised nearly $100 million to help enterprises build customized, contextual language models for their specific use cases [00:00:05]. Previously, Kila was the head of research at Hugging Face, spent five years at Facebook AI Research (Fair), and is an adjunct professor at Stanford [00:00:11].

Contextual AI’s Vision and Approach

Contextual AI was founded to address the frustration enterprises faced with generative AI – while exciting, it wasn’t production-ready for their specific needs [00:03:55]. Da Kila and his team knew RAG would be a solution, but aimed to go “much better” than the existing RAG concept [00:04:11].

Contextual AI operates on two core principles:

Systems over Models: While new models like OpenAI’s 01 are compressing “Chain of Thought” ideas into the model, Contextual AI believes a model is only 10-20% of the much larger system needed to solve enterprise problems [00:04:43]. Enterprises need to buy the entire system, not just a model they then have to build around [00:05:01].
Specialization over AGI: General Artificial General Intelligence (AGI) is fundamentally a consumer product because consumer needs are unknown [00:05:17]. In contrast, enterprises often know exactly what they want and do not want a generalist AI [00:05:31]. For example, a bank in the European Union would face severe sanctions for using a generalist AI system for performance reviews [00:05:44]. The right approach for enterprise AI is through specialization, not generalization [00:06:02].

Contextual AI builds integrated systems, end-to-end specializing all parts, and focuses on high-value, knowledge-intensive use cases where deep integration and specialization pay off [00:06:20]. Unlike the “layered cake” approach of combining many specific infrastructure parts (which can lead to a “Frankenstein’s RAG”), Contextual AI slices vertically through the stack, controlling retrieval, reranking, generation, post-training, alignment, and fine-tuning to deliver a unified solution [00:07:07].

Challenges in Enterprise AI Deployments

While many “compelling demos” are built with a layered approach, they often fail completely during real user testing or when scaled to larger datasets (e.g., 10,000 PDFs instead of 20) [00:07:51]. Beyond machine learning, challenges include risk, compliance, security, and operations [00:08:20]. Contextual AI exclusively focuses on production deployment [00:08:00].

Enterprises must be cautious about directly exposing AI to customers, especially for high-value use cases [00:14:18]. The focus should be on finding the optimal ratio of AI to human, keeping humans in the loop, and solving problems that are currently within reach [00:14:35]. For example, instead of an AI making investment decisions, it should provide great tools to help investors make better decisions [00:15:13].

The Origin of Retrieval Augmented Generation (RAG)

Da Kila’s work on RAG stemmed from a long-standing interest in “grounding” language models [00:09:26]. His PhD focused on grounding language in perceptual information, like understanding the word “cat” by integrating pictures of cats into NLP systems [00:09:31].

At Fair (Facebook AI Research), working with PhD student Ethan Perez, the idea emerged to ground models in Wikipedia [00:10:01]. They built an early RAG prototype, made possible by Facebook AI Image Similarity Search (FAISS), which served as the archetype for vector databases [00:10:11]. A key challenge was figuring out how to backpropagate into the retrieval mechanism to train the system, which many future implementations would avoid by using off-the-shelf components [00:10:34]. The collaboration with Patrick Lewis and Sebastian Riedel in London on open domain question answering solidified RAG’s application [00:11:05].

At the time of publication, it wasn’t apparent that RAG would become such a standard paradigm; breakthroughs often involve much simultaneous work, similar to the Transformer paper, whose historical narrative has been “rewritten” to inflate its initial impact [00:11:39]. The success of Transformers, for instance, was largely due to their optimality for GPUs, not solely their inherent architectural superiority [00:13:17].

Alignment and Reinforcement Learning

Alignment is a critical area for making AI systems maximally useful for end-users [00:16:01]. Reinforcement Learning from Human Feedback (RLHF) was the “secret sauce” behind ChatGPT’s success, allowing models to capture human preferences at the full sequence level [00:16:12].

However, RLHF has two major problems:

Expensive Reward Model: Training a reward model (to propagate rewards back to the sequence) is costly and the model is then discarded [00:16:43].
Preference Data Requirement: Obtaining preference data (e.g., thumbs up/down feedback requiring internal or external annotation) is slow, expensive, and becomes even more so for specialized use cases [00:17:03].

Contextual AI’s research, often in collaboration with Stanford students, aims to break these dependencies:

DPO (Direct Preference Optimization): Achieves alignment without needing to train a separate reward model, making it more efficient [00:17:32].
KTO (Kahneman-Tversky Optimization): Allows direct optimization on feedback without requiring explicit data annotation [00:17:48]. This is based on behavioral economist utility theory and prospect theory [00:18:02].
Clare (Contrastive Revisions): Addresses “underspecification problems” in preference data. Instead of vague rankings, Clare uses “revisions” where a small difference between two options clearly highlights the desired fix, making the preference signal much tighter [00:18:22].
APO (Anchored Preference Optimization): Recognizes the relationship between data and model quality. If the model is better than the preference data (which is possible now), APO ensures the system learns only the ranking preference, not that the “good” answer is the absolute right one [00:19:05].

Contextual AI uses its own data annotation team and gathers direct feedback (e.g., thumbs up/down) from customer deployments, which their algorithms can learn from without standard RLHF [00:20:24]. This alignment work is crucial for the core model during post-training, which is where “a lot of the magic happens” in AI [00:20:47]. By aligning models to specific business use cases (e.g., finance) rather than general knowledge (e.g., quantum mechanics or Shakespeare), specialization and customization are achieved, making models production-ready and delivering real ROI [00:21:10].

ALLOAI and the Trend Towards Smaller Models

Contextual AI, in collaboration with the Allen Institute, released ALLOAI, a powerful small Mixture of Experts (MoE) open-source model [00:22:15]. This work was inspired by the trend towards smaller models that can be deployed on edge devices [00:22:47].

ALLOAI builds on previous work called GRIT (Generative Representational Instruction Tuning), which demonstrated that the same model could be used for both retriever and generator, allowing for significant compute caching [00:22:58]. The vision is to combine ALLOAI with GRIT to create “super powerful RAG systems” deployable on phones [00:23:42].

Academia’s Role in AI Research

Academia continues to play a vital role in AI progress [00:24:22]. While the scale of pre-training has shifted beyond what most universities can do, there’s still significant and interesting research in pre-training with smaller models [00:24:44]. The importance of post-training and alignment methods means that academics can take generously donated pre-trained models (like Meta’s Llama) and conduct amazing research on top of them [00:24:55].

Challenges and Future of Evaluation

One major challenge in enterprise AI is the lack of good off-the-shelf extraction systems, especially for complex documents like PDFs. This crucial “boring stuff” at the beginning of the pipeline is necessary for proper retrieval and contextualization [00:25:47].

Another underdeveloped area is evaluation, particularly for enterprises needing to understand deployment risk and real system accuracy [00:26:40]. Currently, many evaluations are informal, often based on small spreadsheets with high variance, lacking principle [00:27:51]. A key problem is that many people don’t understand what they truly want from an AI system [00:27:29]. Contextual AI spends significant time with customers to define success and productionize prototypes [00:27:34].

The future of evaluation should move towards accessible frameworks for API-focused AI developers, rather than relying on traditional machine learning or data science knowledge around test set creation and statistical testing [00:29:46].

Shifting AI Research Beliefs

Da Kila has observed several changes in his AI research beliefs:

Synthetic data working well: This was initially unexpected [00:30:49].
Agentic workflows with tool use: Considered much more possible now than a year ago, despite agents still being ill-defined [00:30:56].
Test-time compute and Chain of Thought: Initially viewed as a “cute gimmick” or flawed from an evaluation perspective, Chain of Thought has proven to work “really, really well” [00:31:17]. The power often comes from combining these ideas [00:31:49].

Data Scarcity, Multimodality, and Reasoning

The notion that AI is “running out of tokens” for training data is misguided [00:32:53]. Society produces massive amounts of data daily; the real challenge is the lack of high-quality data [00:33:07].

Beyond text, multimodal data (especially video) is largely untapped [00:33:52]. Models currently understand the world indirectly through linguistic behaviors (text), which is a “poor proxy” [00:34:20]. Training on vast amounts of video data, for instance, could help models understand concepts like “cat” in a more human-like way [00:34:31].

Furthermore, synthetic data, if done correctly and combined with smarter algorithms like KTO and APO, can be “super powerful,” reducing the need for expensive data annotation or heavy compute [00:35:17].

Models already possess reasoning capabilities, and these can be improved with more high-quality, specialized data (e.g., from mathematics PhD students for complex reasoning tasks) [00:35:48]. Kila points to “metalinguistic tests” involving self-reference (like his paper “I am a strange data set,” inspired by Douglas Hofstadter’s “I Am a Strange Loop”) as a milestone he’d be impressed to see models master [00:36:47]. This tests the ability to reason about language itself, going beyond simple mathematics [00:37:54].

Underreported Research Areas

Much of the exciting, underreported work in AI is practical, focusing on how to make things actually work, like sophisticated retrieval mechanisms (e.g., “mixture of retrievers” instead of a single dense vector database) [00:38:26]. The industry is navigating a new product paradigm where the product is the model’s behavior, reflecting its training data, which is a complex shift from traditional SaaS development [00:38:48].

Reflections on Former Employers

Facebook AI Research (Fair)

When Da Kila joined Fair in 2016, it was an incredibly free and academic environment [00:40:02]. While he worked on multi-agent systems and emergent communication protocols (which he believes will become relevant again soon), Fair’s major contribution that “really changed the world” was PyTorch [00:41:03]. Da Kila believes Mark Zuckerberg deserves significant credit for visionary leadership in open-sourcing projects like PyTorch, React, and Llama, which has significantly improved Meta’s public perception and aided hiring [00:41:07]. Open-sourcing Llama, in particular, is a pragmatic move to avoid being dependent on other platforms if language models become the new dominant platform [00:42:01].

Hugging Face

Hugging Face is an “amazing company” that has become the central place for publishing AI models [00:45:37]. While they benefit from others’ open-source efforts, they’ve built a “very special place for themselves” [00:46:41]. Their future role could involve expanding into model deployment and experiments around that [00:46:51].

Funding Strategy for Contextual AI

Contextual AI adopts a pragmatic funding strategy, avoiding raising more money than needed at unsustainable valuations [00:47:30]. This is crucial given the potential for the AI hype cycle to “die off a little bit in the next year or two” [00:47:50].

By not training their own large base models (leveraging open-source alternatives like Llama), Contextual AI can be more capital-efficient and allocate resources to essential areas like hiring top talent and working closely with customers to solve problems [00:48:08]. The current stage of AI products often requires “white glove” or dedicated support (like selling a “Tesla Roadster” with engineers), as it’s not yet a “turnkey thing” [00:48:40]. The goal is to build towards an “assembly line for the Model S” that will eventually be a standard, easy-to-use product [00:49:22].

Future of AI

Da Kila believes we are “only just getting started” with AI [00:49:54]. The “cars coming off the assembly line” will be systems, not just models [00:50:00]. While scaling laws are important, there’s significant room for improvement through post-training, alignment, and distilling system thinking back into models, leading to “scaling in many directions” [00:50:11].

Overhyped vs. Underhyped

Agents are simultaneously overhyped (because they don’t fully work yet) and underhyped (because they are showing signs of life) [00:50:57].

Biggest Surprise in Building Contextual AI

The difficulty of building and maintaining a high-end research cluster that actually works was a major surprise [00:51:21]. Hardware failures (GPUs, entire nodes) are frequent, highlighting the fragility of the infrastructure underpinning AI [00:51:44].

Closed Source vs. Open Source Models

Da Kila sees a mix dominating in the long term, with a “triangle” metaphor:

Top: High-end, expensive, closed-source frontier models.
Bottom: Accessible open-source models.
Middle: The most interesting part, where companies achieve the right tradeoffs between capital efficiency, customer usefulness, and price point [00:52:19]. Starting from open-source models and building strong post-training capabilities can lead to this “sweet spot” [00:52:55].

Exciting AI Startups (Outside Contextual AI’s Space)

Da Kila is most impressed by AI-based entertainment companies like Suno and video generation companies, noting that progress in this space is happening “way quicker” than expected [00:53:11]. This could lead to personalized movies and infinite episodes of shows in the future [00:53:56].

Company Most Interesting to Run AI At (Outside Contextual AI)

Beyond obvious tech giants, Da Kila finds traditional enterprises like JP Morgan interesting, where the head of AI gets to solve massive problems by integrating new AI technology before competitors [00:54:53].

Preferred AI Application to Build (If Not Contextual AI)

He would likely still work on something related to work, as he believes it’s the “most obvious place” where AI will change everything and has a “very noble mission” to improve how the world works through AI [00:55:42]. Otherwise, applications in the entertainment space are very interesting [00:55:33].

Da Kila views this as an “amazing point in history” where the world is changing rapidly due to AI [00:56:51].

Tubegraph

Explorer

Table of Contents