From: redpointai
Lyn Chia, co-founder and CEO of Fireworks AI, discusses the evolution of AI systems from single models to complex compound AI systems, focusing on efficient inference. Fireworks AI aims to provide a platform for developers to build these advanced systems, emphasizing quality, low latency, and low cost in the inference stack [01:19:00].

What are Compound AI Systems?

A compound AI system is envisioned as a complex inference system that incorporates logical reasoning and can access hundreds of small expert models [01:41:00]. It goes beyond a simple API call in and out [01:54:00]. The core idea is to move beyond the notion of a “single model as a service” to a system where multiple models across different modalities work together with various APIs holding knowledge (databases, storage, knowledge bases) to deliver the best AI results [04:09:00].

Why Compound AI Systems are Necessary

Traditional single AI models have several limitations [02:17:00]:

  • Non-deterministic Nature: Models are probabilistic by nature, which is undesirable for factual or truthful results [02:21:00]. Controlling verification is crucial [02:34:00].
  • Complex Business Problems: Many business problems require assembling multiple models across various modalities (audio, visual, text) to solve them effectively [02:40:00]. For instance, processing audio and visual information for interactive experiences [03:02:00].
  • Specialized Models: Even within the same modality, like Large Language Models (LLMs), there are many expert models specializing in tasks like classification, summarization, multi-turn chats, and tool calling [03:23:00]. A single model is very limited for real-world problems [03:38:00].
  • Knowledge Limitations: Single models are limited by their finite training data [03:43:00]. Real-world information often resides behind APIs (public or proprietary) that models cannot access directly [03:52:00].
  • No One-Model-Fits-All: Due to the nature of the training process, models become highly specialized in certain areas and weak in others [08:13:00]. This leads to a future of hundreds of small expert models [08:57:00], which is beneficial for the open-source community as it allows for customization and specialization [09:15:00].

Developing Compound AI Systems

Developing these systems involves specific design tools and considerations:

Imperative vs. Declarative Design

  • Imperative Design: Developers have full control over the workflow, inputs, and outputs, aiming for deterministic results [04:49:00].
  • Declarative Design: Developers define what problem the system should solve, allowing the system to determine how to solve it. SQL is an example of a declarative approach [05:10:00].
  • Fireworks AI leans towards a more declarative system design, focusing on simplicity, debuggability, and maintainability, while hiding nitty-gritty details and complexity from the user [06:11:00].

Building Blocks and Abstractions

Fireworks AI started with the lowest level of abstraction: single model as a service [06:51:00]. Today, they provide hundreds of models across various modalities (LLMs, audio, vision, embedding, image/video generation) as foundational building blocks [06:57:00]. However, developers face challenges in assembling these pieces and controlling quality due to the rapid release of new models [07:24:00]. This identified a “huge gap in usability,” especially for enterprises [07:56:00].

Customization and Model Training

Customization is deeply valued at Fireworks AI [10:27:00].

  • Prompt Engineering: Often the starting point for developers to test a model’s steerability [11:07:00]. However, it can become unmanageable with thousands of lines of system prompts [11:16:00].
  • Fine-tuning: The “prime time” for fine-tuning is when prompt engineering becomes too complex, allowing absorption of long system prompts into the model itself for faster, cheaper, and higher-quality inference [11:59:00]. Fireworks AI is working on making customization extremely easy [10:51:00].
  • Pre-training: While some enterprises pre-train models for core business reasons, it is very expensive [13:08:00]. The ROI is generally stronger for post-training (fine-tuning) on strong base models, offering more agility [13:39:00]. There is a general industry shift from pre-training investment to post-training and then inference, as data exhaustion becomes a concern [37:43:00].

F1 and Function Calling

Fireworks AI developed F1, a complex logical reasoning inference system offered as a model API [19:37:00]. Underneath, F1 comprises multiple models and logical reasoning steps, which is highly complex to build and maintain [19:52:00]. It focuses on solving quality-related problems when models interact with each other [20:14:00].

A critical aspect of compound AI systems, and a key feature of F1, is function calling [21:10:00].

  • Function calling allows models to call external tools to enhance answer quality [21:40:00].
  • It’s complex because it often involves multi-turn chat contexts, requiring the model to hold long conversations [21:57:00].
  • Models need to select from potentially hundreds of tools [22:14:00].
  • They must also coordinate calls, executing multiple tools in parallel and sequentially [22:25:00]. F1 can perform parallel and sequential complex planning and orchestration [22:37:00].
  • The precision of tool calling is crucial, making the tuning process complicated [23:20:00].

Fireworks AI’s decision to build F1 and its function calling capabilities in-house stems from the need to strategically invest in critical areas that tie everything together, rather than waiting for open-source solutions [24:48:00]. They believe the hundreds of small expert models will come from the open-source community, and their role is to compose them efficiently [24:01:00].

Reasoning Models

Even for reasoning, there will be different models specialized in various paths [25:28:00]. Some approaches include:

  • Strong base models using self-inspection techniques like chain-of-thought and backtracking [25:34:00].
  • New models performing logical reasoning in the latent space, akin to human thought processes that don’t always use words [26:01:00]. Fireworks AI intends to integrate different flavors of logical reasoning into their system without being opinionated about which will “win” [26:41:00].

Challenges and Strategies in AI Infrastructure

One of the major challenges in building an AI infrastructure company is the rapid pace of change in models and enterprise adoption [47:01:00].

  • Hardware Agnosticism: Fireworks AI absorbs the burden of integrating and determining the best hardware for specific workloads [29:53:00]. They can route to different hardware even for mixed access patterns [30:04:00].
  • Staying Ahead of the Curve: Instead of constantly chasing new developments, Fireworks AI focuses on fundamental trends like specialization and customization, believing a single model won’t fit all [47:46:00]. Their Optimizer tool takes inference workload and customization objectives to suggest optimal deployment configurations [48:59:00].

AI Use Cases with Product-Market Fit

Most successful AI applications currently involve “human-in-the-loop” automation rather than full “human-out-of-the-loop” automation [14:03:00]. This is because AI systems need to be human-debuggable, understandable, maintainable, and operable to gain adoption [14:21:00].

Examples of successful use cases include:

  • Assistants: For doctors (scribing), teachers/students (education), coding (e.g., Cursor, Sourcegraph), and medical assistants [14:50:00].
  • B2B Automation: Call center automation, optimizing business logic, and workflow efficiency [15:38:00].
  • Digital SDRs/Marketing: Early applications showing good adoption [54:26:00].

Regarding model adoption, there’s a significant convergence around variations of Llama models due to their quality, strong base, good instruction following, and fine-tuning capabilities [16:16:00].

The Role of Evals

Many enterprises start with “vibe-based” evaluations for early product development [17:02:00]. However, they quickly realize the need to consciously build robust evaluation systems [17:26:00]. While A/B testing is the ultimate determinant of product impact, it has a longer cycle [17:51:00]. Investing in generating good eval data sets is crucial for understanding what matters and staying on top of the rapidly evolving state-of-the-art models [18:18:00]. As products mature, they move from open-ended design to more specialized features, requiring specialized models and corresponding evaluations [18:48:00].

Local vs. Cloud Inference

Running models locally is argued for cost savings and privacy [33:25:00].

  • Cost Savings: Offloading compute from cloud to desktop makes sense for applications like Zoom [33:53:00].
  • Privacy: While privacy is a concern, much personal data is already on the cloud, making the local privacy argument less straightforward [35:00:00].
  • Mobile Limitations: Offloading to mobile devices is more challenging due to limited power and the impact on application metrics like power consumption and latency [34:02:00]. Practically deployable models on mobile are tiny (1B-10B parameters) with limited capabilities [34:31:00].

The Open Source Ecosystem

Meta’s open-source contributions, particularly the Llama models, are seen as a huge service, providing a strong base for fine-tuning and giving developers control [09:20:00]. Meta is also building “Llama Stack” to standardize tools around Llama models, aiming for an “Android world” where components are standardized and easy to adopt [36:02:00]. Investment in pre-training by major players like Meta is expected to continue as long as there’s sufficient return on investment, which currently relies on access to diverse and high-quality data [36:46:00].

The Future of AI Infrastructure

The competitive landscape for compound AI systems includes players like Databricks, which also recognizes the complexity and potential of this space [39:38:00]. Fireworks AI differentiates itself by not being a GPU cloud provider but by building a complex inference stack on top of GPU clouds, specializing in the combination of engineering craftsmanship and deep research [40:30:00].

The company’s original vision, formed before ChatGPT, anticipated the rise of AI due to insights from hyperscalers like Meta, which were already AI-powered [41:39:00]. Generative AI fundamentally changed accessibility by providing foundation models that absorbed a majority of knowledge, allowing companies to build applications directly or with smaller machine learning teams, driving rapid adoption [43:16:00]. This accessibility shift led Fireworks AI to laser-focus on generative AI, leveraging their expertise in PyTorch models [43:57:00].

Future challenges in AI infrastructure include defining the right user experience and abstraction for agentic workflows [27:03:00]. Research areas of interest include:

  • Model-System Co-design: Optimizing quality, latency, and cost together, as seen at Meta [45:27:00].
  • Disruptive Technologies: Looking for the next generation of Transformers and new approaches to agent communication and reasoning in latent space [46:31:00].

The rapid adoption curve of AI, even among traditional enterprises, indicates a “revolution” that changes not just applications but also technology adoption curves and go-to-market strategies, with shorter sales cycles and open procurement processes [51:01:00]. While startups often prefer low-level abstractions to assemble components, traditional enterprises typically seek higher-level abstractions to avoid low-level details [52:03:00]. Fireworks AI builds both, as the lowest level abstraction is necessary for internal development [52:52:00].

Fireworks AI offers a self-serve platform for developers at fireworks.ai, providing access to their playground and hundreds of model capabilities [55:04:00].