Alignment in reinforcement learning

Alignment is a crucial problem area in AI, focusing on making systems maximally useful for end-users [00:16:01].

Reinforcement Learning from Human Feedback (RLHF)

RLHF was the “secret sauce” that made ChatGPT effective [00:16:16]. While instruction tuning (SFT) can capture human preferences at the next token level, RLHF is designed to capture preferences at the full sequence level [00:16:29].

However, RLHF presents two main challenges [00:16:43]:

Reward Model Training [00:16:45]: It requires training a separate, high-quality reward model to propagate rewards back to the sequence, which is expensive and the model is not used in actual generations [00:16:50].
Preference Data Acquisition [00:17:03]: Obtaining preference data (e.g., thumbs up/down feedback) necessitates additional data annotation, which is slow and costly, especially for more specialized use cases [00:17:12].

Advancements in Alignment Optimization

To address the limitations of traditional RLHF, research has focused on optimizing directly from feedback without needing to train a reward model or perform extensive data annotation [00:17:29]:

DPO (Direct Preference Optimization) [00:17:36]: This method focuses on training without a separate reward model, making the process more efficient [00:17:34].
KTO (Kahneman-Tversky Optimization) [00:17:56]: Developed at Contextual AI, this approach breaks the dependency on preference pairs by optimizing directly on feedback without requiring data annotation [00:17:42]. It is based on the behavioral economist utility theory and prospect theory [00:18:02].
CLARE (Contrastive Learning for Alignment with Revisions) [00:19:03]: This method addresses the “under-specification” problem in preference datasets [00:18:19]. Instead of just ranking options, CLARE contrasts revisions, where a small difference between two options represents a specific “fix,” making the preference signal much tighter [00:18:45].
APO (Anchored Preference Optimization) [00:19:55]: Building on CLARE, APO incorporates the quality of the model being trained into the optimization process [00:19:16]. If the model is better than the preference data, it learns only the ranking information (“this one is better than this one”) rather than the specific “right answer” from the data, which may be suboptimal [00:19:32]. APO provides more control over how data quality impacts model quality after training [00:20:06].

Practical Application in Enterprise AI

At Contextual AI, alignment work is often focused on the core model during post-training [00:20:47]. By leveraging algorithms like KTO and APO, Contextual AI can learn directly from customer feedback, such as thumbs up/down mechanisms in deployments, which was not feasible with standard RLHF [00:20:33].

This alignment process tailors models to specific business use cases, moving beyond generalist AI to specialized, customized solutions [00:21:07]. This specialization helps models pass the “production bar” for enterprise deployments, leading to measurable Return on Investment (ROI) [00:21:39]. It reflects a “systems over models” approach, where the aim is to deliver integrated systems that solve specific problems, rather than just providing general models [00:04:43].

Tubegraph

Explorer

Table of Contents

Alignment in reinforcement learning

Reinforcement Learning from Human Feedback (RLHF)

Advancements in Alignment Optimization

Practical Application in Enterprise AI

Graph View