RFT and finetuning techniques

Introduction to Finetuning

Finetuning is a process that allows models to be adapted for specific use cases and user preferences [00:55:00]. It is considered a great way to please a wider range of users [00:55:00].

Evolution of Finetuning Perception

Initially, some people questioned the helpfulness of fine-tuning [00:21:22]. However, there has been a “renaissance” for fine-tuning with newer models, demonstrating its actual utility [00:21:26]. Michelle Prokris, a post-training research lead at OpenAI, noted that she used to be a “fine-tuning bear” but now believes it is worth the time for specific domains where pushing the frontier is necessary [00:44:10]. This shift is partly due to the fact that the RFT (Reinforcement Finetuning) process is similar to the internal reinforcement learning algorithms OpenAI uses to train its models [00:44:42].

Types of Finetuning

Michelle Prokris categorizes fine-tuning into two main camps:

Fine-tuning for speed and latency (SFT - Supervised Fine-Tuning) [00:21:40]. This is the “workhorse” of OpenAI’s SFT offering, allowing models like GPT-4.1 to run at a fraction of their usual latency [00:21:49]. If a model works well but faster performance is desired, finetuning with SFT (and considering models like Mini and Nano) is recommended [00:27:39]. It can be used for simpler tasks, such as classifying things where a small percentage of cases are incorrect [00:24:13].
Fine-tuning for frontier capabilities (RFT - Reinforcement Fine-Tuning) [00:22:01].
- Purpose: RFT allows users to push the frontier in their specific area [00:22:08]. It is extremely data efficient, potentially requiring only hundreds of samples [00:22:16] and is less fragile than SFT [00:23:38]. RFT was scheduled to be generally available (GA) in the week following the podcast recording [00:22:25].
- Applications:
  - Teaching an agent how to pick a workflow or work through its decision process [00:22:37].
  - Deep tech applications where an organization has unique, verifiable data, leading to the “absolute best results” [00:22:48]. Examples of such domains include chip design and biology (e.g., drug discovery), where exploration is needed but successes are easily verifiable [00:24:47].
- When to use: RFT should be considered when “no model in the market does what you need” [00:24:22].
Preference Fine-Tuning: Used primarily for stylistic preferences [00:24:02]. This was launched somewhat recently [00:24:07].

Strategic Considerations for Companies

For companies using OpenAI’s APIs, especially with rapid model releases, the most successful ones are those with strong internal evals tailored to their specific use cases [00:17:57]. They can then quickly run these evals on new models when they are released [00:18:06]. Successful customers also adapt their prompts and scaffolding to particular models [00:18:15].

A key recommendation for companies is to “build stuff which is maybe just out of reach of the current models” [00:18:25]. If a use case works only one out of ten times (10% pass rate) but can be fine-tuned to 50%, it’s likely something a future model will “crush” within a few months, making it a good candidate to work on [00:18:53]. Building scaffolding to make a product work is worthwhile for a few months of “arbitrage” until the capability becomes more easily available natively in future models [00:19:54]. However, it’s crucial to be prepared to change things and keep an eye on future trends like improving context windows, reasoning capabilities, and instruction following [00:20:16]. Connecting the model to as much information about a task as possible is also beneficial, even if current results are mediocre, as future models will likely improve [00:21:09].

Generalization vs. Specialized Models

OpenAI’s general philosophy leans into the “G” in AGI (Artificial General Intelligence), aiming to make one general model rather than purpose-built models for different groups [00:15:52]. The goal is to simplify the product offering and have one model for both chat and API use cases [00:16:06]. However, the development of GPT-4.1, which focused on developers, was an exception due to a particularly acute need and the ability to move faster by decoupling from ChatGPT [00:16:15]. This allowed for specific choices in model training, such as removing ChatGPT-specific datasets and significantly upweighting coding data [00:16:35].

Despite this targeted approach, the general trend indicates that combining everything into one model typically produces a much better result [00:26:05]. For example, learning to use one set of tools makes a model better at using other sets of tools [00:34:21]. The capabilities of models like GPT-3 already encompass many deep research functions, offering a quicker alternative to specialized deep research agents [00:34:38]. The future challenge in models like GPT-5 is to combine diverse capabilities, such as being a delightful chitchat partner and knowing when to engage in deep reasoning [00:37:57].

Tubegraph

Explorer

Table of Contents