AI model pretraining and finetuning decisions

From: redpointai

Jonathan Frankle, Chief AI Scientist at Databricks, advises enterprises on strategic decisions regarding AI model selection, training, fine-tuning, and prompt engineering [00:00:07]. His insights stem from working with Databricks’ 12,000 customers on AI [00:00:14].

The Iterative Journey of AI Model Development

The path to deploying AI systems is not always clear from the outset [00:07:41]. Frankle emphasizes keeping options open and starting small [00:07:56].

Starting Small: Prompting

The journey typically begins with simple prompting of existing models, such as OpenAI’s or Llama models available on Databricks [00:08:21]. This initial step helps to litmus test if AI is suitable for a given use case, as predictability is low [00:08:32]. It’s an experiment, like data science in the literal sense [00:08:40].

Incorporating Data: Retrieval Augmented Generation (RAG)

If initial prompting shows promise, the next step often involves bringing enterprise-specific data to bear using hardcore RAG [00:09:27]. Generic models won’t know about internal data, so it must be integrated [00:09:31].

Fine-tuning

If RAG delivers value, fine-tuning becomes a consideration [00:09:50]. This bakes more specific knowledge into the model, incurring a higher upfront cost but potentially offering better quality in a smaller package, thereby reducing inference costs or improving output quality [00:09:51].

Continued Pre-training

Beyond fine-tuning, some organizations might consider continued pre-training [00:10:03]. This requires even more significant upfront investment and is justified by extensive model usage [00:18:46].

Full Pre-training from Scratch

Pre-training a model from scratch is a massive undertaking, expensive and labor-intensive, and is generally discouraged unless absolutely necessary [00:10:08]. However, for those with unique needs, it’s an option that Databricks supports [00:10:10].

Justifying the “Work Your Way Up”

The progression from prompting to full pre-training should be justified by rigorous Return on Investment (ROI) [00:09:04]. Each step involves an upfront investment that pays for itself quickly if there’s enough usage [00:18:44]. This strategic decision-making represents a cost-quality trade-off, allowing companies to either achieve the same quality at a lower inference cost or get a higher quality model for the same cost [00:18:10]. Once product-market fit is achieved, it becomes a matter of optimizing for quality and cost [00:19:28].

When Company/Domain-Specific Models Make Sense

There are specific scenarios where investing in dedicated model training, even full pre-training, is logical:

Language and Domain Gaps: When general models struggle with specific languages (e.g., Japanese, Korean, Indian languages due to less training data or tokenizer issues) [00:16:26].
Unique Task Domains: For tasks fundamentally different from general language understanding, like protein modeling [00:17:13].
Speed and Specificity Requirements: When a very fast and highly specific model is crucial, such as for code completion for free-tier users where cost is a major constraint [00:17:27].
Cost Optimization for High Usage: For models with very high usage, the upfront cost of pre-training can lead to significant long-term savings or quality improvements [00:17:58].

The Role of Data and Evaluation

A common mistake is waiting for perfect data or perfect evaluation metrics before starting AI development [00:10:14]. Frankle advises an agile approach:

Iterate Quickly: Do just enough data work to interact with a model, build the “crappiest model” possible, create a quick evaluation, and test it against the real world [00:10:44].
Human Feedback: Connect with human testers, even just one friend at the company, for real-world feedback [00:11:32]. Databricks’ team conducts A/B testing of model outputs and RLHF-style pairwise comparisons without revealing the model source [00:11:46].
Simple Evaluations: Start with simple evaluations, even just five examples with graded responses (e.g., 1-5 scale), which can calibrate an LLM judge [00:13:01].
Dedicated Tools: Databricks offers a new agent evaluation product to help users create meaningful evaluation sets quickly, aiming for a few dozen examples in an afternoon [00:13:32]. This tool is a key focus for Frankle, as a measuring stick is essential for any progress [00:13:37].

Insights from Databricks’ DBRX Model Development

The development of Databricks’ DBRX language model exemplified the end-to-end platform approach [00:14:39]. They used Spark for data ingestion and ETL, Delta tables for storage, Unity Catalog for tracking data sets, MosaicML tools for model training, and MLflow for experiment tracking [00:14:43]. This integrated use of their own products saved significant time [00:14:58].

Future Outlook

Frankle believes the open-source model world is “exceedingly well covered” [00:38:30]. Therefore, Databricks’ focus shifts to other gaps to ensure customer success [00:37:07]:

Evaluation Creation: Helping customers build their initial measuring sticks [00:37:15].
Navigating Options: Guiding customers through the complex decisions of RAG versus prompting versus fine-tuning, providing many options without overwhelming them [00:37:25].
Data Challenges: Assisting customers in building AI systems with imperfect or fragmented data [00:37:40].
Compound AI Systems and Agents: Focusing on connecting different pieces of data and models to solve specific problems [00:38:05].

The field is still learning what works best in different scenarios [00:31:22]. The goal is to provide maximum choice with minimal cost, then help customers make the best selection and confidently deploy models into production [00:32:23]. Even with advancements, AI systems are fundamentally “fuzzy,” and achieving “fifth nine” reliability with current technology remains challenging [00:24:07]. This understanding helps manage expectations and focus on learning the strengths and weaknesses of the technology [00:24:23].

Tubegraph

Explorer

Table of Contents