From: redpointai

Fireworks, an inference-focused generative AI platform, envisions a future where AI systems are complex, involving logical reasoning and access to hundreds of small, expert models [00:01:41]. The company aims to deliver the best quality, lowest latency, and lowest cost inference [00:01:19].

Limitations of Single AI Models

Traditional single AI models have several inherent limitations:

  • They are probabilistic by nature, making them undesirable for delivering consistently factual or truthful results [00:02:21].
  • Solving complex business problems often requires assembling capabilities from multiple models across different modalities [00:02:40].
  • Even within the same modality, such as Large Language Models (LLMs), there are many expert models specializing in tasks like classification, summarization, multi-turn chats, or tool calling, each with slight differences [00:03:23].
  • Single models are limited by their training data, which is finite and not infinite [00:03:43]. Much real-world information exists behind APIs, both public and proprietary, which models cannot access directly [00:03:52].

To overcome these limitations, the industry needs to move beyond single-model-as-a-service to “compound AI systems” [00:04:09]. These systems involve multiple models across various modalities, integrated with different APIs that hold knowledge from databases, storage systems, and knowledge bases, all working together to deliver optimal AI results [00:04:19].

The Role of Customization

Fireworks believes deeply in customization [00:10:25]. The nature of model training means that no single model fits all problems [00:08:13]. Training is an iterative process where developers pick specific problem subsets to optimize, devoting significant resources to data acquisition and quality in those areas [00:08:24]. Consequently, a model will excel at certain tasks but perform poorly at others [00:08:48].

The future of AI models is seen as hundreds of small, expert models [00:08:57]. When a problem is narrowed down, it becomes easier for smaller, specialized models to achieve high quality [00:09:01].

Customization Strategies

  1. Prompt Engineering: Many enterprises start with prompt engineering due to its immediate, responsive, and interactive nature [00:11:00]. However, this approach can become cumbersome with thousands of lines of system prompts, making further steerability difficult [00:11:22].
  2. Fine-tuning: When prompt engineering becomes unmanageable, fine-tuning is the prime time to absorb lengthy system prompts into the model itself [00:12:02]. This means the model has already proven its ability to solve the problem and follow instructions [00:12:12]. Fine-tuning allows the model to run faster, cheaper, and with higher quality [00:12:35].
  3. Pre-training: While some enterprises pre-train models for core business reasons, it is very expensive in terms of money and human resources [00:13:16]. The Return on Investment (ROI) is much stronger with post-training (fine-tuning) on top of strong base models, as it allows for more agile testing of ideas [00:13:43]. The industry is currently seeing investment shift from pre-training to post-training and then to inference, indicating a “soft wall” in data availability for pre-training [00:37:52].

The Role of Open Source Models

Open source models play a crucial role in enabling customization [00:09:20]. They provide control and allow model providers to focus on post-training or fine-tuning, delivering specialized models back to the community that are highly effective at solving specific problems [00:09:39].

Meta, through its Llama models and the Llama Stack standard, is working to standardize tools around Llama models, fostering an “Android world” where components are easily pluggable and adoptable [00:36:07]. This continuous investment from Meta (e.g., Llama 4 is expected) is driven by the ROI in pre-training, which will continue until a data wall is hit [00:37:27].

Fireworks’ Approach to Complex AI Systems

Fireworks builds its offerings on top of open source models [00:23:57], believing that hundreds of small expert models will emerge from the open-source community [00:24:05]. Their strategic investment is in the “compound AI system” layer, which focuses on composing these small expert models to solve complex business tasks efficiently [00:24:20].

F1: A Compound AI System

Fireworks developed F1, a model API that functions as a complex logical reasoning inference system [00:19:41]. Underneath, F1 comprises multiple models and implements logical reasoning steps [00:20:02]. Building such a system is more complex than a single-model-as-a-service inference [00:20:10], involving significant challenges in quality control when models communicate with each other [00:20:34].

Function Calling for Agents

A critical component of compound AI systems and agents is function calling [00:21:14]. This allows models to call external tools to enhance answer quality [00:21:44]. Function calling is complex, often requiring the model to:

  • Maintain a long context of conversation in multi-turn chats [00:22:05].
  • Select from potentially hundreds of tools [00:22:17].
  • Execute multiple tools in parallel or sequentially, requiring complex coordination and planning [00:22:30].
  • Possess strong capability to understand when and how to use a tool, driving precision [00:23:20].

Fireworks has invested in function calling capabilities for about a year [00:23:29], recognizing its strategic importance in tying everything together within a compound AI system [00:24:50].

Evolution of Product Development and Go-to-Market Strategy

Initially, Fireworks catered to developers needing to train models from scratch [00:43:09]. However, the advent of generative AI, particularly with foundational models like ChatGPT, fundamentally changed accessibility [00:43:26]. Now, companies don’t need large machine learning teams to curate data and train models from scratch [00:43:22]. Instead, they can build directly on existing foundation models or fine-tune them with thousands of samples [00:43:33]. This increased accessibility led Fireworks to laser-focus on generative AI, leveraging their expertise in PyTorch [00:44:17].

The adoption curve for AI technology has changed dramatically, with startups, digital natives, and even traditional enterprises adopting it concurrently [00:50:47]. This differs from traditional sequential adoption models [00:50:27]. The sales cycle is shorter, and procurement processes are more open to new thinking [00:51:43].

While startups often prefer access to low-level abstractions to assemble components, traditional enterprises typically desire higher-level abstractions that hide complex details [00:52:30]. Fireworks offers multiple abstraction layers to cater to these different needs [00:52:46].