From: redpointai

Lynn Chia, co-founder and CEO of Fireworks.ai, a company focused on fast and efficient inference for compound AI systems, shared her insights on the evolving open source ecosystem [00:00:15]. Previously, she was one of the creators of PyTorch at Meta [00:00:06]. Fireworks.ai aims to deliver the best quality, lowest latency, and lowest cost for inference [00:01:19].

The Vision for Compound AI Systems

Fireworks.ai envisions a future where inference systems are complex, featuring logical reasoning and access to hundreds of small expert models [00:01:41]. Lynn emphasizes that inference is not as simple as a single model as a service [00:01:28]. The problems they are solving go beyond simple API calls [00:01:51].

The limitations of single models include:

  • Non-deterministic nature: Models are probabilistic, which is undesirable for delivering factual and truthful results [00:02:21].
  • Complexity of business problems: Many customer problems require assembling multiple models across various modalities [00:02:40].
  • Multimodality: Modern applications often process audio, visual, and textual information simultaneously [00:03:00].
  • Expert models within modalities: Even within Large Language Models (LLMs), different expert models specialize in tasks like classification, summarization, or tool calling [00:03:22].
  • Knowledge limitations: Single models are limited by finite training data, while real-world information resides behind public or proprietary APIs [00:03:43].

These limitations necessitate a “compound AI system” approach, where multiple models across different modalities, along with various APIs, databases, and knowledge bases, work together to deliver optimal AI results [00:04:05].

The Rise of Small Expert Models and Customization

Lynn believes the future lies in hundreds of small expert models [00:08:57]. When a problem is narrowed down, it becomes easier for smaller models to excel and push quality boundaries [00:09:01]. This trend is highly beneficial for the open source community because open source base models offer significant control for customization [00:09:15]. Many model providers fine-tune and specialize models, contributing back to the open source community with models highly effective for specific problems [00:09:28]. Enterprises are moving towards having more control and steerability in this multi-model world [00:09:47].

Fireworks.ai deeply believes in customization [00:10:25]. The process of customization is not straightforward, but Fireworks.ai is working to make it extremely easy [00:10:51]. They observe a strong trade-off between prompt engineering and fine-tuning [00:10:43].

  • Prompt Engineering: Many developers start with prompt engineering due to its immediate results and responsiveness [00:11:07]. However, it can lead to thousands of lines of system prompts, making management and changes difficult [00:11:16].
  • Fine-tuning: When prompt engineering becomes unwieldy, fine-tuning is the next step to absorb long system prompts into the model itself [00:11:59]. This is typically done after the model has proven steerable for the problem [00:12:07]. Fine-tuning results in faster, cheaper, and higher-quality model performance [00:12:28].

While pre-training is becoming concentrated among hyperscalers [00:12:54], some enterprises do pre-train models for core business reasons [00:13:08]. However, pre-training is very expensive, and the Return on Investment (ROI) is much stronger with post-training on strong base models, allowing for more agile testing of ideas [00:13:28].

Fireworks.ai’s Strategic Use of Open Source

Fireworks.ai heavily builds on the open source community, believing that hundreds of small expert models will emerge from it [00:23:55]. Their vision aligns with leveraging this energy to build compound AI systems that utilize the best models [00:24:08].

The F1 Model and Function Calling

Fireworks.ai’s F1 model, offered as an API, is a complex logical reasoning inference system that uses multiple models and implements logical reasoning steps [00:19:37]. Building such a system is complex due to quality control challenges when models interact with each other [00:20:14].

A critical aspect of compound AI systems is function calling, which serves as an extension point for models to call other tools and enhance answer quality [00:21:40]. Function calling is complex because:

  • Models need to maintain long context in multi-turn chats to influence tool selection [00:21:57].
  • They often need to call multiple tools (potentially hundreds) in parallel or sequentially [00:22:11].
  • Precision in tool calling and driving composition is crucial [00:23:18].

Fireworks.ai’s function calling capability handles parallel and sequential complex planning and orchestration [00:22:37]. They strategically invest in this area, which is a critical ingredient for tying everything together in a compound AI system [00:24:48].

Integration with Other Tools

Fireworks.ai maintains compatibility with imperative agentic tools like LangChain, seeing them as strong partners rather than competitors. This allows Fireworks.ai to simplify the layer above single models by composing multiple models where it makes sense [00:38:17].

The Role of Hyperscalers and Meta’s Open Source Strategy

Hyperscalers aim to build vertically integrated stacks, like Apple’s iPhone, leveraging massive resources for data centers, power, and machine deployment [00:30:57]. Fireworks.ai specializes in problems requiring engineering craftsmanship and deep research, deploying scalable systems on top of GPU clouds [00:32:03].

Meta is a significant contributor to the open source ecosystem through its Llama models and the “Llama Stack” initiative to standardize tools around Llama [00:35:58]. Meta’s ambition is to create an “Android world” for AI, with standardized, easily pluggable components [00:36:26]. Lynn anticipates continuous investment from Meta in training new models like Llama 4 [00:36:45].

The investment in pre-training may slow down when the ROI diminishes, likely due to hitting a “data wall” as common internet and synthetic data sources become exhausted [00:37:00]. Lynn observes that investment is already shifting from pre-training to post-training and then to inference [00:37:43].

The Impact of Generative AI

The advent of Generative AI (GenAI), particularly ChatGPT, fundamentally changed the accessibility of AI technology [00:42:01]. Before GenAI, companies needed to hire large machine learning teams to train models from scratch, which was time-consuming and required scarce talent [00:42:27]. GenAI’s foundation models, which absorb vast amounts of knowledge, allow developers to build on top of them directly or with minimal fine-tuning [00:43:22]. This drastically lowers the barrier to entry, enabling application and product teams to build without needing large machine learning teams, leading to rapid adoption [00:43:48]. The fact that most GenAI models are PyTorch-based also aligned well with Fireworks.ai’s expertise in operating complex PyTorch models in production [00:44:04].

Overhyped vs. Underhyped

Lynn considers the perception of GenAI as “magical” and a “recipe for all problems” to be overhyped [00:49:50]. She reiterates her belief that no single model can solve all problems in the best or correct way [00:50:06]. The underhyped aspect is not explicitly stated but implied through her emphasis on specialized models and compound systems to address the limitations of single, “magical” models.

Lynn also noted a shift in the adoption curve for AI. Initially, she expected startups to lead, followed by digital natives, and then traditional enterprises [00:50:25]. However, all three are adopting AI simultaneously, with significantly shorter sales cycles and a tremendous appetite for the technology [00:50:47].

Ultimately, Fireworks.ai’s strategy remains anchored in the belief that specialization and customization are the future, regardless of how core model capabilities evolve [00:48:16]. They provide an inference optimizer that takes workload and customization objectives as input, spitting out optimized deployment configurations and potentially adjusted models, making customization simple and closing the loop for developers [00:48:58].