From: redpointai

Post-training is a crucial phase in AI model development, where pre-trained models are further refined to enhance their performance for specific tasks and user preferences [02:20:51]. It’s often where “a lot of the magic happens in AI,” turning a general pre-trained model into one that excels at what it’s needed to be good at [02:57:57].

Systems Over Models

Contextual AI, a company co-founded by Da Kila, focuses on a “systems over models” approach, viewing the model as only 10-20% of the larger system required to solve a problem [04:43:42] [04:54:00]. This approach emphasizes that enterprises need a complete system, not just a model, to avoid the complexity of building the surrounding infrastructure themselves [05:01:03] [05:05:03]. This integrated system approach allows for end-to-end specialization and is particularly effective for high-value, knowledge-intensive use cases [06:20:31] [06:29:10]. By controlling retrieval, reranking, generation, post-training, and alignment, companies can achieve a compounding effect that leads to better problem-solving [07:16:04].

Specialization for Enterprise Use Cases

Unlike Artificial General Intelligence (AGI), which is considered a consumer product due to unknown consumer needs [05:19:00], enterprise AI often requires specialized intelligence [05:31:00]. For example, a banking AI system should not be a generalist that could, for instance, perform performance reviews, as this could lead to heavy sanctions in regions like the European Union [05:44:09] [05:51:00]. The correct way to approach enterprise AI is through specialization, focusing on what the user wants [06:02:00].

Specialization through alignment and post-training helps make systems much more convincing for production deployment [02:11:00] [02:18:20]. For instance, an AI for finance doesn’t need knowledge of quantum mechanics or Shakespeare; it needs to be highly proficient in its specific domain [02:24:00].

Finetuning and Reinforcement Learning Techniques for Post-Training

Alignment is a critical problem area in post-training, aiming to make systems maximally useful for end-users [01:59:00].

RLHF (Reinforcement Learning from Human Feedback)

RLHF was the “secret sauce” behind ChatGPT’s success, building upon initial instruction tuning (SFT) [01:14:00]. It captures human preferences at the full sequence level, rather than just the next word [01:31:00]. However, RLHF has two main challenges:

  1. Reward Model Training: It requires training a separate, good reward model to propagate rewards back to the sequence, which is expensive and the model isn’t used in actual generation [01:43:00].
  2. Preference Data: It relies on preference data, meaning that if a user gives a thumbs-down, additional manual annotation is needed to specify what a “thumbs-up” response would look like [01:57:00]. This process is slow, expensive, and becomes even more costly for specialized use cases [02:19:00].

DPO (Direct Preference Optimization)

DPO aims to break the dependency on training a separate reward model, making the process more efficient [01:36:00].

KTO (Kahneman-Tversky Optimization)

Developed by Contextual AI with Cwin (a Stanford student), KTO directly optimizes on feedback without needing explicit preference pairs, thus eliminating the need for data annotation [01:40:00] [01:47:00]. KTO is based on behavioral economics utility theory and prospect theory [01:57:00].

Clare (Contrastive Language-Action REvisions)

Clare, developed with Carl, addresses the under-specification problem in alignment data [02:18:00]. Instead of just ranking options, Clare focuses on contrasting revisions: identifying a problem with an option and providing a small, specific fix, thereby making the preference signal much tighter [02:26:00].

APO (Anchored Preference Optimization)

APO considers the relationship between the data and the model, specifically how good the model being trained on the preference data is [02:37:00]. If the model is better than the preference data, it should learn the ranking rather than the “right answer,” which might be suboptimal [02:49:00]. APO provides greater control over how data quality impacts the trained model’s quality [02:59:00].

Data for Post-Training and Future of AI

Concerns about running out of data tokens for training AI models are largely unfounded [03:26:00]. Society produces massive amounts of data daily, far exceeding current training rates [03:33:00]. The real challenge lies in the quality of this data [03:41:00].

  • Data Quality vs. Quantity: Lower quality data requires commensurately more quantity [03:52:00]. The bottleneck is high-quality data, not overall data availability [03:57:57].
  • Multimodality: Moving to multimodal data, especially video, offers a vast, largely untapped resource [03:59:00]. Training on multimodal data can help models better understand the world, addressing a significant shortcoming of current systems [04:16:00].
  • Synthetic Data: Despite some flawed research suggesting otherwise, synthetic data, when generated correctly, is “super powerful” and can reduce reliance on data annotation and heavy computation, especially when combined with algorithms like KTO and APO [04:26:00] [04:47:00].

Challenges and Progress in AI Model Alignment Research

One of the biggest surprises for Da Kila was how well synthetic data works [03:45:00]. Additionally, “agentic workflows with tool use” are much more feasible than previously expected [03:56:00]. The effectiveness of “Chain of Thought” reasoning, initially seen as a gimmick, has proven to be very powerful when combined with techniques like RLHF for model training [04:08:00] [04:18:00].

Underreported Research Areas

Practical work on retrieval, such as using a “mixture of retrievers” instead of a single dense vector database, is an area of significant, though perhaps underreported, interest [04:27:00]. The field is exploring how to make these systems actually work and finding the right product form factor for AI research [04:44:00].

Evaluation and Deployment Challenges in Production

Despite many compelling demos, a lot of AI deployments, especially in enterprises, are still just “demos happening” [07:51:00]. These demos often fail in real user testing due to issues beyond machine learning, including deployment, risk, compliance, and security [08:00:00]. Many demos are built on small, “hacked” test sets (e.g., 20 PDFs) and break down when scaled to real-world data (e.g., 10,000 PDFs) [08:35:00].

For enterprises, the key question is whether a system can be put in front of customers [09:00:00]. As use cases become higher value, they also become riskier to expose directly to customers [09:26:00]. The goal is to find the optimal ratio of AI to human interaction, keeping humans in the loop for problems within reach [09:36:00].

The Need for Better Evaluation Frameworks

There is currently no reliable standard way to evaluate AI systems for enterprises, especially concerning deployment risk and real accuracy [02:40:00]. Many companies don’t seriously evaluate their systems, relying on unprincipled spreadsheets with high variance [02:51:00].

A critical step is to clearly define what success looks like for a customer, often involving a collaborative hill-climbing process to achieve success in a prototype setting before productionizing it [02:29:00]. Future evaluation frameworks should be accessible to AI developers, who are increasingly proficient in calling APIs rather than traditional machine learning or statistical testing [02:40:00].

Wishlist for External Contributions

To streamline end-to-end solutions, better off-the-shelf extraction systems are needed [02:45:00]. Properly contextualizing language models requires accurate extraction, which is surprisingly difficult for tasks like extracting information from PDFs [02:50:00].

Academic Role in Post-Training Research

Academia remains crucial for the progress of AI [02:40:00]. While pre-training at the current scale is no longer feasible for universities [02:47:00], the importance of post-training offers a significant opportunity [02:57:00]. Academics can leverage pre-trained models (e.g., those generously open-sourced by Meta/Facebook) to conduct valuable research in post-training methods and better alignment techniques [03:02:00].

Conclusion

Post-training model optimization is vital for delivering production-ready AI solutions, especially in the enterprise context. It involves complex techniques like RLHF, DPO, KTO, and APO, and requires a shift towards thinking about AI as integrated systems rather than monolithic models. While challenges in data quality and evaluation persist, ongoing research and pragmatic approaches are paving the way for more specialized, useful, and scalable AI deployments.

Overhyped vs. Underhyped

Agents are both overhyped (as they don’t fully work yet) and underhyped (because they are showing signs of life and potential) [05:57:00]. The shift to smaller models deployable on edge devices is a significant trend [02:47:00].