Databricks approach to AI model evaluation and incentives

From: redpointai

Jonathan Frankle, Chief AI Scientist at Databricks, outlines the company’s comprehensive approach to supporting enterprises in their AI journey, emphasizing pragmatic model selection and evaluation, and the unique incentive structures fostering innovation within his team [00:00:00].

Enterprise AI Adoption Strategy

Databricks guides its 12,000 customers through the AI adoption journey by prioritizing flexibility and return on investment (ROI) [00:08:00]. The initial recommendation is to “start small and work your way up” [00:09:01]:

Prompting: Begin by prompting an existing model (e.g., OpenAI or Llama on Databricks) for a “litmus test” to see if AI is suitable for a given use case [00:08:21].
Retrieval-Augmented Generation (RAG): If initial prompting shows promise, integrate proprietary enterprise data via RAG to enhance model relevance, as generic models won’t know internal data [00:09:27].
Fine-tuning: For more significant value, fine-tuning can bake knowledge into the model, leading to better quality in a smaller package and reducing inference costs [00:09:50].
Continued Pre-training / Pre-training from Scratch: This is the most significant undertaking, recommended only when justified by rigorous ROI and substantial usage [00:10:02]. Databricks offers the capability to pre-train models from scratch, which is “not for the faint of heart” but has made a “huge difference” for some customers [00:10:09].

A common mistake observed is waiting for “perfect data” or “perfect evaluation” before starting AI projects [00:10:14]. Instead, the advice is to be agile: do just enough data work to interact with a model, build a quick evaluation, test, and then iterate based on real-world performance [00:10:44].

AI Model Evaluation and Benchmarking

Evaluation methodologies are critical, and Databricks acknowledges that initial benchmarks are rarely perfect proxies for real-world scenarios [00:11:22].

Key aspects of AI evaluation and benchmarking:

Human-in-the-Loop Testing: Involving human testers, even a single friend at the company not involved in the project, provides more valuable feedback than synthetic benchmarks [00:11:30]. Jonathan’s team conducts A/B testing of model outputs, including natural language and image generation, often without knowing if the output came from their model, Llama, or OpenAI [00:11:45].
Simple Eval Creation: Start with five examples for evaluation data, and rate responses from one to five, even without perfectly correct answers. This can calibrate an LLM judge [00:13:01].
New Evaluation Product: Databricks has released an “agent evaluation product” that assists users in creating meaningful evaluation sets, aiming to allow users to build a robust eval set of a few dozen examples in an afternoon [00:13:32]. This product is key because “until you have a measuring stick, anything else you do is kind of you’re just making things up” [00:13:39].

Databricks Platform and Innovation

Databricks’ platform integrates various tools to support the entire AI lifecycle:

Data Ingestion and Storage: Utilizes Spark for ETL, Delta tables for storage, and Unity Catalog for tracking data sets [00:14:43]. Spark dramatically reduced processing times from weeks to minutes [00:15:16].
Model Training and Experiment Tracking: Leverages Mosaic tools for model training and MLflow for experiment tracking [00:15:30].
Inference: Uses Mosaic inference service [00:15:44].
Product Development: The philosophy is to build products that the internal team wants to use [00:15:51]. The highest endorsement of Databricks’ products is that Jonathan’s team uses all of them [00:14:29].

Jonathan Frankle’s Incentive System

Jonathan Frankle employs a distinctive system to motivate his team and partners:

Hair Dyeing: He dyed his hair blue when their open source model (dbrx) was initially released, offering his “body on the line to incentivize them to do awesome work” [00:00:58].
Swords: He buys swords for engineers, external partners, and the legal team who provide “great acts of service” to the research team, explicitly stating “swords go to the people who provide service to our team” [00:01:15]. There is even a “Databricks approved sword vendor” [00:01:28].
Food: His direct team receives “cookies,” “cake,” or “Chipotle” as incentives [00:12:29].
Openness to new incentives: He is “always open” to new tributes from his team that would warrant redyeing his hair [00:01:40].

Views on AI Development and Challenges

Transformer Dominance: Frankle maintains a “long view” on his bet that Transformers will remain the dominant architecture, noting that current models are essentially the original Transformer with minor tweaks [00:02:57]. New architectures are “hard to come by,” and science tends to move in “big leaps” followed by consolidation [00:04:18].
Domain-Specific Models: Company-specific or domain-specific models are valuable when:
- General models are not good at the task, especially for non-English languages [00:16:22].
- The task is fundamentally different (e.g., protein modeling) [00:17:13].
- A really fast, specific model is needed where cost is a major factor (e.g., code completion for free tier users) [00:17:27].
- It’s a cost decision, where upfront investment in pre-training pays for itself quickly through better quality or reduced inference costs [00:17:56].
“Fuzziness” of AI: AI’s “fuzziness” is both a superpower and a challenge [00:22:37]. While techniques like chaining models can push towards higher quality, achieving “perfection” or “five nines of quality” is hard with current technology [00:24:07].
Product-Market Fit: AI has found product-market fit in two main patterns:
- Scenarios where outputs don’t need to be perfectly “right,” such as brainstorming or creative applications [00:19:47].
- Scenarios where AI-generated answers are costly to produce manually but quick for a human to check, like code co-pilots [00:20:14].
AI Infrastructure Landscape: Databricks seeks to have all tools available and working well together on one platform, while also partnering with “amazing Point Solutions” from startups to ensure customers have access to the best tools [00:33:30]. Acquisitions, like Lilac, happen when a product is amazing and aligns well with Databricks’ offerings [00:34:52].
Future Focus: Databricks’ priorities include evaluation creation, navigating fine-tuning versus RAG, and developing compound AI systems to connect various pieces and make raw material useful to customers [00:37:13]. He believes the open-source model world is “exceedingly well covered,” so Databricks focuses on other gaps [00:38:29].

Policy and Trust in AI

Frankle emphasizes that AI experts “owe it to society” to participate in policy conversations, not just with self-interest, but to ensure responsible use [00:53:00]. Key policy considerations include:

Determining Use Cases: Society needs to decide when to allow AI systems and when not to, especially in high-stakes contexts like law enforcement, medicine, or autonomous vehicles where mistakes can have severe consequences [00:55:19].
Standards for AI vs. Humans: Holding AI to rigorous standards often reveals that human performance is also lacking, potentially leading to improved standards for both [00:27:31].
Transparency and Honesty: Building trust in the AI industry requires honesty about capabilities and limitations, clearly stating what is known and unknown [00:57:52].

Areas of Interest

Robotics/Embodied Systems: While not “underexplored,” Frankle is excited about the potential of robotics to perform unscalable tasks and interact with the physical world, similar to how digital technology interacts with the information world [00:58:53].
AI for Smell (Osmo AI): Highlighted as a “truly creative” application where AI is enabling new possibilities in unexpected domains [00:48:06].
Human-AI Interaction (HCI): This is a field where Databricks is seeking expertise [00:50:00].
Data Annotation: Essential for AI, requiring significant expertise and trust in partners like SuperAnnotate and Surge [00:51:18].
Experimentation with New Products: Frankle appreciates companies like Anthropic for their willingness to take risks and experiment with new AI-driven products [00:46:04].

Frankle concludes that Databricks is the “nexus” of science, policy, and society in AI, being on the ground floor to observe how AI is applied to “most economically useful tasks” across its 12,000 customers [01:01:04].

Tubegraph

Explorer

Table of Contents