Inference challenges and strategies in AI systems

From: redpointai

Fireworks is a platform focused on fast and efficient inference for compound AI systems [00:00:03]. The primary goal of Fireworks is to deliver the best quality, lowest latency, and lowest cost in the inference stack [01:19:19].

Limitations of Single-Model Inference

The traditional approach of viewing inference as a single model as a service is overly simplistic [01:28:09]. There are significant limitations with single models that make them insufficient for real-world problems:

Non-Determinism: Models are probabilistic by nature, which is undesirable when factual and truthful results are required [02:20:90].
Complex Business Problems: Many business problems require assembling multiple models across different modalities to solve them effectively [02:40:90].
- For example, current interactive applications process audio and visual information, similar to how GenAI-based native applications need to process multiple modalities [03:00:70].
- Even within the same modality (like Large Language Models, LLMs), there are many expert LLMs specializing in classification, summarization, multi-turn chats, and tool calling [03:22:90].
Knowledge Limitations: Single models are limited by their training data, which is finite [03:43:40]. A lot of real-world information exists behind public or proprietary APIs, which models cannot directly access [03:51:70].

The Need for Compound AI Systems

The industry is moving beyond single-model services to a concept called compound AI systems [04:09:40]. These systems involve multiple models across different modalities, combined with various APIs that hold knowledge (databases, storage systems, knowledge bases) working together to deliver the best AI results [04:16:10].

Building Compound AI Systems: Design and Tools

Developing these complex systems requires specific tools and design philosophies.

Imperative vs. Declarative Design

There are two main schools of thought for system design:

Imperative: Developers have full control over the workflow, inputs, and outputs, aiming for deterministic outcomes [04:49:00].
Declarative: Developers define what problem the system should solve, and the system figures out how to solve it [05:10:00]. SQL in databases is an example of a declarative approach [05:32:00].

Fireworks leans towards a more declarative system design with full debuggability and maintainability [06:33:00]. The goal is to provide the simplest user experience by hiding complex details in the backend without sacrificing the speed of iteration [06:17:00].

Evolution of Offerings

Fireworks started with the lowest level of abstraction: single models as a service [06:51:00]. Today, they offer hundreds of models across various modalities (Large Language Models, Audio models, Transcription, Translation, Speech Synthesis, Vision models, Embedding models, Image generation, Video models) [06:57:00].

However, assembling these numerous building blocks and maintaining quality control is difficult for developers, especially with new models being released weekly [07:24:00]. This led to the realization of a “huge gap of usability” for enterprises [07:56:00].

The “No One Model Fits All” Problem

A core observation is that there is no single model that fits all needs [08:13:00]. Due to the nature of the training process, models become exceptionally good at specific tasks they are optimized for, and weaker at others [08:24:00].

The future lies in hundreds of small expert models [08:57:00]. Shrinking a problem to a narrower space inevitably makes it easier for smaller models to achieve higher quality [09:04:00]. The open-source community plays a crucial role by providing base models that can be customized and fine-tuned to deliver specialized models [09:15:00].

Customization and Fine-tuning

Fireworks deeply believes in customization [10:25:00]. The challenge is making this process extremely easy for enterprises.

Prompt Engineering vs. Fine-tuning

Prompt Engineering: Developers often start with prompt engineering to quickly test if a model can be steered in a desired direction [11:07:00].
Limitations of Prompt Engineering: As systems become more complex, managing thousands of lines of system prompts becomes problematic and difficult to change [11:16:00].
Transition to Fine-tuning: When prompt engineering becomes unwieldy, it’s an opportune time for fine-tuning [11:59:00]. Fine-tuning allows absorbing the long system prompt into the model itself, making it faster, cheaper, and higher quality [12:28:00]. This often aligns with a transition from pre-product market fit to post-product market scaling [12:22:00].

Pre-training for Enterprises

While the general trend is towards hyperscalers pre-training models [12:54:00], some enterprises do pre-train models if it’s core to their business [13:08:00]. However, pre-training is very expensive and requires significant money and human resources [13:28:00]. Post-training on top of strong base models often offers a much stronger ROI and greater agility for testing ideas [13:36:00].

Use Cases with Product Market Fit: Human-in-the-Loop Automation

Most successful GenAI applications today are human-in-the-loop automation, not fully automated systems [14:03:00].

The hypothesis is that a GenAI system must be human-debuggable, understandable, maintainable, and operable [14:16:00]. If humans cannot evaluate, maintain, or operate a system in production, it’s hard to gain adoption [14:27:00].

Examples of successful human-in-the-loop applications include:

Assistants for Professionals: Doctors (scribing), teachers/students (education, foreign languages) [14:52:00].
Coding Assistants: Cursor, Sourcegraph [15:04:00].
Medical Assistants: Addressing the shortage of nurses [15:15:00].
B2B Automation: Call center automation, where AI assists human agents to be more productive [15:38:00].
Business Workflow Optimization: Improving efficiency in various business processes [16:05:00].

Model Adoption Trends

There’s a significant convergence of variations of Llama models, which is a testament to their quality as strong base models for instruction following and fine-tuning [16:16:00].

Challenges and Strategies in AI Model Evaluation (Evals)

Many enterprises start with “vibe-based” evaluations during early product development [16:59:00]. However, they quickly realize the need to consciously build dedicated evaluation datasets, as this is a crucial investment area [17:23:00].

Importance of Evals: Evals are essential for tracking quality, especially when moving into deeper prompt engineering or fine-tuning [17:41:00]. While A/B testing is the ultimate determinant of product impact, it has longer cycles [17:47:00].
Navigating a Dynamic Landscape: Enterprises invest in generating good eval datasets to understand what matters most in a constantly evolving landscape of specialized models [18:18:00]. This allows them to harden their product design and move from open-ended products to specialized features requiring specialized models [18:48:00].

Fireworks’ F1 Model and Function Calling

Fireworks developed its own model, F1, as a complex logical reasoning inference system offered via an API [19:23:00].

Underlying Architecture: F1 consists of multiple models and logical reasoning steps implemented within the system [19:52:00].
Complexity: Building such a system is more complex than simple single-model inference, especially regarding quality problems when models interact with each other [20:04:00]. The complexity is compared to building a database management system [20:36:00].
Optimization: Fireworks, with its expertise in optimizing inference, focuses on managing overall inference latency and cost within this multi-step process [20:52:00].

Function Calling

Function calling is a critical extension point for models to call other tools and enhance answer quality [21:38:00].

Growing Demand: Many F1 waiting list users are building agents and require function calling [21:32:00].
Complexity: Function calling is not just about calling a single tool [21:50:00]. It often involves:
- Holding long contexts in multi-turn chat scenarios [21:57:00].
- Selecting from hundreds of tools [22:14:00].
- Executing multiple tools in parallel and sequentially [22:22:00].
- A complex coordination plan in one shot [22:27:00].
Precision: The model’s ability to understand when to call a tool and drive precision is vital [23:16:00].

Strategic Decision to Build

Fireworks heavily leverages the open-source community, as it aligns with their vision of hundreds of small expert models [23:55:00]. However, they also invest in strategically critical areas, such as the orchestration layer for compound AI systems, where an intelligent model can call into different tools (including other expert models) [24:10:00].

Reasoning Models and Future Research

Even for reasoning, there are different approaches. One path involves strong base models using self-inspection techniques like Chain-of-Thought or Tree-of-Thoughts [25:24:00].

A very exciting area of research is new models that can perform logical reasoning not in the prompt space, but in the latent space [25:56:00]. This aims to make the thinking process more efficient and native to the model’s internal workings, similar to how human thinking doesn’t always rely on words [26:08:00].

Fireworks plans to integrate different flavors of logical reasoning into their system [26:41:00]. The company’s F1 model development serves as an exercise to understand system abstraction and the complexity of building logical reasoning engines [27:55:00]. This knowledge will then be used to expose developer-facing plugins, allowing others to build their own “F1s” [28:13:00].

Hardware and Infrastructure Challenges

A significant challenge in AI infrastructure is the scarcity of developers who understand low-level hardware optimization [29:00:00]. The hardware development cadence has also accelerated, with new hardware skills emerging annually from vendors [29:09:00].

Workload-Dependent Optimization: There isn’t a single “best hardware” for all models or even for one model, as it depends on the workload pattern and specific bottlenecks [29:25:00].
Fireworks’ Approach: Fireworks absorbs the burden of integrating and determining the best hardware for different workloads, even routing to different hardware for mixed access patterns [29:49:00]. This allows developers to focus on product building [30:12:00].
Hyperscalers vs. Specialized Platforms: While hyperscalers aim to be vertically integrated (like Apple building iPhones) by offering massive infrastructure for data centers and machines [30:57:00], companies like Fireworks specialize in problems requiring a combination of engineering craftsmanship and deep research, which can then be deployed at scale [32:03:00]. Inference systems are complex and not easily solved by simply throwing more people and money at them [33:03:00].

Local Model Deployment

Arguments for running models locally include cost saving (avoiding cloud GPU fees) and privacy [33:25:00].

Desktop vs. Mobile: Offloading compute from the cloud to desktop can save costs, similar to applications like Zoom [33:53:00]. However, offloading to mobile is a different story due to limited power, which affects application metrics like cold start time and power consumption, influencing user adoption [34:02:00]. Models practically deployable on mobile are tiny (1B, 10B parameters) and have limited capabilities [34:31:00].
Privacy Nuance: The privacy argument for local deployment is debatable, as much personal data is already stored in the cloud, not on local disks [35:03:00].

Open Source Ecosystem and Future of AI Investment

Meta’s strategy with open-source Llama models and the Llama Stack aims to standardize the tool stack around these models [35:58:00]. This creates an “Android world” where components are standardized and easy to adopt [36:28:00].

Investment in pre-training by large model providers will likely continue until the ROI diminishes, primarily when hitting a “data wall” (exhausting internet, synthetic, and multimedia data) [36:57:00]. However, there’s already a shift in investment and ROI from pre-training to post-training, and then to inference [37:43:00].

Challenges in AI Research and Potential Solutions

The rapid pace of change in models and enterprise adoption poses a challenge for infrastructure companies [47:01:00]. The key is to anticipate trends rather than constantly chasing new developments [47:40:00].

Enduring Trends: Fundamental trends like specialization and customization are unlikely to change [48:10:00]. Fireworks’ stack is built to heavily enable easy customization, including an “Optimizer” that takes inference workload and customization objectives as input to spit out optimized deployment configurations [48:47:00].
Overhyped/Underhyped: The perception of GenAI as a magical solution to all problems is overhyped [49:50:00]. There is no single magical model that solves all problems in the best or correct way [50:06:00].
Shifting Adoption Curve: The adoption curve for GenAI has been much faster and broader than initially imagined [50:58:00]. Startups, digital natives, and traditional enterprises are all adopting GenAI simultaneously [50:47:00]. This revolution is changing not only applications and technology adoption but also go-to-market strategies, with shorter sales cycles and different procurement processes [51:22:00].
Tailoring for Users: Startups typically prefer access to lower-level abstractions for more control, while traditional enterprises often prefer higher-level abstractions that hide low-level details [51:57:00]. Fireworks aims to provide both layers of abstraction [52:52:00].

Further Information

To learn more about Fireworks, visit their self-serve platform at fireworks.ai, which offers access to their playground and hundreds of model capabilities [55:01:00].

Tubegraph

Explorer

Table of Contents