Trends in AI model training and deployment

From: redpointai

Lyn Chia, co-founder and CEO of Fireworks, discusses the evolving landscape of AI model training and deployment, highlighting the shift towards complex, compound AI systems and the challenges and opportunities within the industry.

Fireworks’ Core Vision: Compound AI Systems

Fireworks, a generative AI platform with a focus on inference, aims to deliver the best quality, lowest latency, and lowest cost inference solutions [01:14:00]. The company envisions future inference systems as complex networks involving logical reasoning and access to hundreds of small expert models [01:41:00]. This approach moves beyond simple “model as a service” API calls, addressing the limitations of single models [01:28:00].

Challenges in AI Model Deployment

Current models face limitations because they are probabilistic, making it difficult to deliver consistently factual results to end-users [02:17:00]. Complex business problems often require assembling multiple models across various modalities (e.g., audio, visual, text) to achieve solutions [02:40:00]. Even within the same modality, like large language models (LLMs), different expert models specialize in tasks such as classification, summarization, or tool calling [03:23:00]. Furthermore, single models have limited knowledge, constrained by their finite training data [03:43:00], with much real-world information residing behind public or proprietary APIs [03:51:00].

The solution, according to Fireworks, is the “compound AI system,” which integrates multiple models across different modalities with various APIs, databases, storage systems, and knowledge bases to deliver optimal AI results [04:05:00].

Approaches to Building AI Systems: Imperative vs. Declarative

When building these complex systems, two design philosophies emerge:

Imperative: Developers have full control over the workflow, inputs, and outputs, making the system deterministic [04:49:00].
Declarative: Developers define “what” problem the system should solve, allowing the system to figure out “how” to achieve it [05:10:00]. An example is SQL, where users define the desired data, and the database system determines the most efficient execution plan [05:35:00].

Fireworks leans towards a more declarative system with full debuggability and maintainability, aiming to deliver the simplest user experience by hiding complex details [06:11:00].

Evolution of AI Model Development and Customization

Fireworks started by offering the lowest level of abstraction: single models as a service, including hundreds of models across various modalities like LLMs, audio, vision, and embedding models [06:51:00]. However, developers faced significant challenges in assembling these pieces and controlling quality due to the constant release of new models [07:24:00]. This highlighted a usability gap, especially for enterprises [07:56:00].

The “No One Model Fits All” Observation

There is no single AI model that fits all problems [08:13:00]. Model training is an iterative process where resources are heavily dedicated to specific problems, leading to models that excel in certain areas but perform poorly in others [08:24:00]. Lyn Chia believes the future lies in “hundreds of small expert models” that can thrive in narrower problem spaces by pushing quality [08:57:00]. This trend is beneficial for the open-source community, allowing for greater customization and specialized model development through post-training or fine-tuning [09:15:00].

Prompt Engineering vs. Fine-tuning vs. Pre-training

Prompt Engineering: Many enterprises start with prompt engineering due to its immediate results and responsiveness, allowing quick testing of a model’s steerability [10:55:00]. However, managing thousands of lines of system prompts becomes a complex problem [11:16:00].
Fine-tuning: When prompt engineering becomes unwieldy, fine-tuning is the next step to absorb long system prompts into the model itself [11:59:00]. This makes the model faster, cheaper, and higher quality [12:28:00]. This typically occurs as a product moves from pre-product market fit to post-product scaling [12:22:00].
Pre-training: While hyperscalers consolidate pre-training data, the ROI for most enterprises in pre-training models is often not justified given the expense and human resources required [12:54:00]. Post-training on strong base models offers a much stronger ROI and greater agility [13:36:00].

Fireworks believes in customization and aims to make the customization process extremely easy [10:50:00].

Enterprise AI Adoption and Deployment Models

Successful generative AI applications tend to involve “human-in-the-loop” automation rather than full “human-out-of-the-loop” automation [14:03:00]. A generative AI system must be human-debuggable, understandable, maintainable, and operable to gain adoption [14:16:00].

Examples of successful applications include:

Assistants for doctors (scribing), teachers, and students (educational, foreign languages) [14:52:00].
Coding assistants (e.g., Cursor, Sourcegraph) [15:04:00].
Medical assistants addressing nursing shortages [15:15:15].
B2B automation, such as call center assistants that make human agents more productive [15:38:00].

In terms of model adoption, Lyn Chia observes a strong convergence and adoption of various Llama models due to their quality, strong base, excellent instruction following, and fine-tuning capabilities [16:16:00].

Evaluations (Evals) in Enterprise AI

Many enterprises initially use “vibe-based” evaluations [16:59:00]. However, they quickly recognize the need to consciously invest in building evaluation datasets to stay on top of the state of the art and evaluate quality, especially when moving into deeper prompt engineering or fine-tuning [17:23:00]. While A/B testing is the ultimate method for determining product impact, it has a longer cycle [17:47:00]. Generating good eval data helps companies understand what truly matters amidst the constant stream of new, specialized models [18:18:00].

Fireworks’ F1 and Function Calling

Fireworks developed F1, a complex logical reasoning inference system available as an API, which comprises multiple underlying models and implements logical reasoning steps [19:37:00]. Building such a system is complex, involving challenges in inter-model communication and quality control [20:14:14].

Function Calling is a critical extension point that allows models to call other tools to enhance answer quality [21:38:00]. Key challenges include:

Models needing to hold long contexts in multi-turn chats to influence tool selection [21:57:00].
The ability to call into multiple tools (potentially hundreds) [22:11:00].
Complex coordination requiring parallel and sequential tool execution planning [22:20:00].

Fireworks has invested in this area for about a year, noting that when they initially launched their function calling model, they were ahead of the adoption curve [23:24:00]. They strategically invest in these critical areas, such as orchestration, viewing each small expert model as a tool itself, to tie everything together in a compound AI system [24:10:00].

Hardware and Infrastructure Trends

There is a scarcity of developers who understand low-level hardware optimization [28:54:00]. The pace of hardware development has accelerated, with new hardware generations emerging annually from vendors [29:06:00]. Accessing the “best” hardware is complex, as it depends on the workload pattern and specific bottlenecks [29:22:00]. Fireworks absorbs the burden of integrating and determining the optimal hardware for different workloads, even routing to mixed hardware for varying access patterns [29:49:00]. Their goal is to alleviate this concern for developers, allowing them to focus on product building [30:08:00].

Local Model Deployment

Arguments for running models locally include cost savings (avoiding cloud GPU payments) and privacy [33:25:00].

Desktop: Offloading compute from the cloud to desktop for consumer-facing applications makes sense, similar to how Zoom saves costs [34:51:00].
Mobile: Offloading to mobile is a different story due to limited power, where application metrics like power consumption directly affect user adoption and experience [34:02:00]. Models practically deployable to mobile are tiny (1B-10B parameters) with limited capabilities [34:31:00].
Privacy: The privacy argument is nuanced, as most personal data already resides in the cloud [35:03:00].

Open Source and Meta’s Role

Meta has provided a significant service to the AI ecosystem by training and open-sourcing larger models, such as Llama [35:21:00]. They are also building a standard called “Llama Stack” to standardize tools around Llama models [36:00:00], aiming to create an “Android world” where components are easily pluggable and adoptable [36:28:00]. Lyn Chia expects continuous investment from Meta in this area, anticipating Llama 4 [36:45:00].

The investment in pre-training by model providers will continue as long as there is sufficient ROI [36:57:00]. However, the industry is approaching a “soft wall” of data, as everyone has crawled the same internet data and exhausted synthetic and multimedia data combinations [37:07:00]. This will lead to a shift in investment and ROI from pre-training towards post-training and, ultimately, inference [37:43:00].

Competitive Landscape and Fireworks’ Differentiation

Fireworks maintains compatibility with imperative agentic tools like LangChain, viewing them as complementary rather than competitive, and aims to simplify the layer above single-model services [38:17:00].

The concept of “compound AI systems” (coined by Berkeley) is a new and meaningful category where multiple players, including Databricks, are emerging [39:32:00]. Fireworks aims to be a key player in this space by providing tools to make development more efficient [40:11:00]. Unlike GPU cloud providers, Fireworks builds on top of GPU clouds to offer a complex inference stack, rather than providing cheap GPU access [40:28:00].

Fireworks’ Journey and Evolution

Fireworks was founded a few months before ChatGPT’s release [40:48:00], at a time when there was debate about whether AI had truly arrived [41:15:00]. Lyn Chia, coming from Meta (which was heavily AI-powered), saw the coming wave of AI, driven by PyTorch adoption [41:39:00].

ChatGPT fundamentally changed the accessibility of AI [42:01:00]. Before generative AI, companies needed to hire large machine learning teams to train models from scratch and curate data [42:27:00]. Generative AI’s foundation models absorb a majority of knowledge, allowing companies to build directly on top of them (as-is or with fine-tuning) without needing extensive ML teams [43:22:00]. This massive unblocking of accessibility led to the rapid adoption of generative AI [43:43:00]. Fireworks pivoted to laser focus on generative AI due to this strong pull and their expertise in operating PyTorch models in production [43:57:00].

Unsolved Problems and Future Research in AI Infrastructure

Key areas of ongoing development and research include:

Agentic Workflows: The industry is still in the early stages of defining the right user experience and abstraction for building agentic workflows [47:03:00].
Model-System Codesign: Research focusing on optimizing quality, latency, and cost by considering model and system design together [45:32:00].
Next-Generation Transformer Architectures: Research into disruptive technologies beyond current Transformer architectures that will change how models are trained and inference is performed [46:27:00].
Agent Communication in Latent Space: Exploring new ways agents can communicate, for instance, by “thinking” in latent space rather than words, potentially leading to more efficient thinking processes [46:43:00].

Challenges of an Infrastructure Company in a Fast-Changing Field

The rapid pace of change in model capabilities and enterprise adoption poses a challenge for an infrastructure company [47:01:00]. Fireworks avoids constantly chasing new trends by anchoring to core beliefs:

Specialization and Customization: The fundamental trend of models becoming more specialized and customizable will not change [48:16:00]. There is no “one size fits all” solution for proprietary data and diverse workloads [48:27:00].
Control over Destiny: Providing users with the ability to customize, steer, and control their AI destiny is crucial [48:42:00].

Fireworks addresses this by offering an inference engine that fits all, alongside an “Optimizer” that takes workload and customization objectives as input to spit out optimized inference deployment configurations and potentially adjusted models [48:55:00]. This process closes the loop and makes customization easy, a fundamental principle that Lyn Chia believes will persist [49:23:00].

Overhyped vs. Underhyped

Overhyped: The perception that generative AI is “magical” and a panacea for all problems, capable of providing correct answers to any question [49:50:00]. This view is currently undergoing a correction [50:00:00].
Underhyped: The “agentic world” has not been fully figured out yet but holds significant potential [54:10:00].

Shift in Go-to-Market Strategy

Initially, Lyn Chia hypothesized a sequential adoption curve: startups first, then digital natives, and finally traditional enterprises [50:25:00]. However, in reality, Fireworks is working with all segments simultaneously, demonstrating a dramatically accelerated adoption curve driven by a “revolution” in AI [50:47:00]. This has led to shorter sales cycles and different procurement processes [51:35:00].

For startups, there’s typically a desire for lower-level abstractions, allowing them more control [51:57:00]. Traditional enterprises, conversely, often prefer higher-level abstractions that hide complex details [52:22:00]. Fireworks builds both, as the lowest-level abstraction is necessary for internal development, and anticipates adoption across different layers [52:52:00].

To learn more about Fireworks, visit their self-serve platform at fireworks.ai [55:01:00].

Tubegraph

Explorer

Table of Contents