Challenges and strategies in AI deployment

From: redpointai
Jonathan Frankle, Chief AI Scientist at Databricks, focuses on helping enterprises navigate the complexities of AI adoption and deployment [00:00:07]. His work involves guiding companies on when to train their own models, fine-tune existing ones, or simply use prompt engineering [00:00:09].

Core Guidance for AI Model Deployment

Frankle advises organizations to keep their options open when embarking on their AI journey, as the ultimate destination (e.g., prompting vs. pre-training) is not always obvious at the outset [00:07:56]. Databricks aims to provide comprehensive tools so no one has to compromise on their AI development path [00:08:00].

The recommended approach is to:

Start small and work your way up [00:09:01].
Run experiments to test if AI is suitable for a given task, even by manually pulling in documents and prompting a model [00:08:43].
Justify scaling efforts with rigorous ROI [00:09:04].

Data and Evaluation Challenges

A common mistake is believing that AI can only be implemented once data or evaluations are perfect [00:10:14]. Instead, the usefulness of an AI system dictates the quality of the data, and the real-world performance validates the evaluation [00:10:24].

Key advice for data and evaluation:

Be agile: Do just enough data work to interact with the model, build a quick evaluation, test it, and then iterate to refine the model, data, or even reassess expectations from AI [00:10:44].
Human input is crucial: Any evaluation is a proxy for the real world [00:11:25]. Having even one person not involved in the project act as a human tester for the process is more valuable than synthetic benchmarks [00:11:32]. Databricks teams perform A/B testing with human feedback, like RLHF (Reinforcement Learning from Human Feedback), for model outputs [00:11:46].
Start simple with evaluations: Begin by writing five examples, even without precise right/wrong answers, and assign quality scores (e.g., 1-5) [00:13:01]. This can calibrate an LLM (Large Language Model) judge [00:13:18].
Utilize tools: Databricks has released an agent evaluation product designed to help users create meaningful evaluation datasets of a few dozen examples in an afternoon, by leveraging automated tools while valuing human time [00:13:50].

Scenarios for Custom Model Development and Deployment

While large, generic models are powerful, there are specific scenarios where developing domain- or company-specific models, or engaging in continued pre-training, makes sense [00:16:03]:

Model performance for specific languages: Generic models may not be well-tuned for languages like Japanese or Korean due to less available training data [00:16:26]. Companies in these regions often need to build their own models [00:16:56].
Different task domains: For tasks fundamentally different from typical language models, such as protein modeling, specialized models are necessary [00:17:13].
Need for speed and specificity: Some applications require extremely fast and highly specific models, like code completion tools that need to serve all users, including free tiers, efficiently [00:17:27].
Cost optimization: While pre-training models is expensive upfront, for high-usage models, it can lead to a better cost-quality trade-off [00:17:58]. This means either achieving the same quality at a lower inference cost or obtaining a better quality model for the same cost [00:18:10]. The upfront investment pays for itself quickly with sufficient usage [00:18:43].

The Cost-Quality Trade-off Journey

The journey from simple prompting to fine-tuning, continued pre-training, and full pre-training represents increasing upfront investment for improved cost-quality trade-offs [00:19:03]. Organizations progress along this path as usage justifies the investment [00:19:08]. Once product-market fit is achieved, the focus shifts to optimizing quality and cost [00:19:28].

Product-Market Fit in AI and Navigating Fuzziness

Product-market fit in AI tends to occur in two patterns [00:19:43]:

Scenarios where precision isn’t paramount: Applications where there are many “right” answers, such as brainstorming, creative applications, marketing, or surfacing information (e.g., Glean), do not require perfect accuracy [00:19:50].
Scenarios with human checks: Use cases where the AI’s output is relatively costly for a human to produce but quick for a human to check [00:20:14]. Code copilots are a prime example, as checking suggested code is faster than writing it from scratch [00:21:02]. Similarly, customer support is a good fit [00:21:28].

AI’s “fuzziness” is both a superpower and a challenge [00:22:37]. While chaining models and creative uses can push towards higher quality, achieving “perfection” (e.g., five nines of quality) with current technology is difficult [00:24:10].

The Long Journey of AI Maturation

Frankle emphasizes that the current state of AI is analogous to early software engineering [00:24:40]. It took decades to learn how to build structured programs, manage vulnerabilities, and handle massive code bases [00:24:56]. Even if AI technology were to freeze today, enormous creativity and advancements would still occur as we learn to better use these tools [00:25:28].

A significant challenge in high-stakes areas like healthcare and autonomous vehicles is the lack of human intuition about when AI systems will fail, unlike with human errors [00:26:08]. This unpredictability makes it harder for society to make peace with AI mistakes [00:26:50]. Holding AI to rigorous standards, however, might lead to reassessing and improving human performance standards as well [00:27:22].

Databricks’ Role in AI Deployment

Databricks focuses on providing an end-to-end platform that integrates all necessary tools for AI development and deployment, from data ingestion to evaluation [00:30:09]. This includes using Spark for data ingestion, Delta tables for storage, Unity Catalog for tracking, MosaicML tools for training, and MLflow for experiment tracking [00:14:43].

Key aspects of their strategy include:

Integrated platform: Ensuring all tools work well together under one roof to avoid issues with data transiting between different places [00:30:37].
Customer-centric development: Working hand-in-hand with customers to understand their challenges and refine product offerings [00:30:26].
Addressing the “last mile”: Focusing on how to make raw AI materials useful to customers by connecting diverse data to AI system building processes, including through compound AI systems and agents [00:38:15].
Providing choice without overwhelm: Helping customers measure and choose from many options, including different fine-tuning techniques (like “soft” fine-tuning for fragmented data) [00:32:21].
Partnerships and Acquisitions: Databricks partners with many startups that build phenomenal point solutions (e.g., data annotation, eval creation) [00:33:30]. Acquisitions, like Lilac, happen when a tool is so valuable (e.g., used internally for DBRX) that integrating the team makes sense for customers [00:34:52].

Current Gaps and Future Focus in AI Deployment

Frankle identifies several gaps in the current AI landscape that Databricks is focused on:

Measurement: Still learning how to best help customers build methods to measure their AI products effectively [00:31:00].
Navigation of approaches: Understanding what combination of RAG (Retrieval Augmented Generation), prompting, and fine-tuning works best for different use cases [00:31:19].
Data challenges: Helping customers with messy or incomplete data sets (e.g., many inputs but few outputs) to build effective AI systems [00:37:40].
Production comfort: Ensuring customers feel comfortable deploying and running AI systems in production environments [00:32:29].

Frankle believes the open-source model world is “exceedingly well covered” [00:38:30], allowing Databricks to focus on these “last mile” challenges to make AI useful for its 12,000 customers [00:38:15].

Policy and Responsible AI Deployment

Frankle stresses that AI experts have a responsibility to participate in policy discussions to ensure systems are used responsibly [00:53:00]. Key policy questions include:

When to allow/disallow systems: It is acceptable to say that certain AI systems are not reliable enough for specific contexts [00:55:21].
Setting standards: Rigorous standards for AI systems, particularly in high-stakes areas like law enforcement, medicine, and autonomous vehicles, where mistakes can have severe consequences [00:56:07].
Scientific honesty: Being transparent about what is known and unknown about AI capabilities, and not over-promising future advancements, helps build public trust [00:57:46].

Underexplored Areas

While AI applications are being explored everywhere, Frankle is personally excited about robotics and embodied systems, viewing them as powerful tools for interacting with the physical world, similar to how information systems interact with the digital world [00:59:19].

Tubegraph

Explorer

Table of Contents