From: redpointai

Lyn Chia, co-founder and CEO of Fireworks, a company focused on fast and efficient inference for compound AI systems, highlights the critical role of hardware and computation in the evolving AI landscape. Fireworks aims to deliver the best quality at the lowest latency and cost in the inference stack [00:01:19].

The Complexity of AI Inference

Inference is not a simple “single model as a service” operation [00:01:28]. The future of inference involves complex systems with logical reasoning, accessing hundreds of small expert models [00:01:41]. Fireworks envisions a world where they route user queries to the best-performing model for that specific query [00:02:02].

This complexity arises because models are probabilistic, not deterministic, which is undesirable for factual results [00:02:21]. Solving complex business problems often requires assembling across multiple models and modalities [00:02:40]. Additionally, single models have limited knowledge based on their finite training data [00:03:43]. The next barrier in AI is moving beyond single models to compound AI systems that combine multiple models across modalities and integrate with APIs, databases, and knowledge bases [00:04:09].

Hardware Landscape and Optimization

The AI hardware space is experiencing rapid change, with new hardware generations emerging annually instead of every three years [00:29:09]. There is a scarcity of developers with low-level hardware optimization expertise [00:28:54].

Key considerations in the hardware landscape:

  • No “One Size Fits All”: There is no single “best” hardware for every workload pattern, even for the same model [00:29:27]. Different hardware skills are best for removing specific bottlenecks [00:29:42].
  • Abstraction: Fireworks abstracts the burden of integrating and determining the best hardware for a given workload [00:29:53]. They can even route mixed access patterns to different hardware [00:30:02].
  • Support for Diverse Chips: Fireworks supports AMD chips for inference, alongside Nvidia [00:28:45].
  • Focus on Product: Developers should focus on building products, while Fireworks manages the complexity of hardware optimization [00:30:12].

Cost and Efficiency in Inference

The goal of fireworks is to deliver the best quality, lowest latency, and lowest cost in the inference stack [00:01:22].

  • Prompt Engineering vs. Fine-tuning: While prompt engineering is useful for quick testing, long system prompts become difficult to manage and lead to increased cost and slower model performance [00:11:16]. Fine-tuning can absorb these long prompts into the model itself, resulting in faster, cheaper, and higher-quality results [00:12:00].
  • Pre-training: Pre-training models is very expensive, requiring significant money and human resources [00:13:28]. The ROI is much stronger with post-training (fine-tuning) on top of a strong base model [00:13:39]. While some enterprises do pre-train due to core business reasons, it is often a question of differentiation [00:13:08].
  • Compound AI Systems for Cost Optimization: Fireworks’ F1 system, a complex logical reasoning inference system, operates underneath using multiple models and logical steps [00:19:37]. This complexity makes overall inference latency and cost a key area of focus [00:20:49].
  • Hyperscalers vs. Specialized Providers: Hyperscalers aim to be vertically integrated like Apple, building data centers, acquiring power, and deploying vast machines for large-scale storage and compute [00:31:01]. Fireworks specializes in solving problems that require a combination of engineering craftsmanship and deep research, deploying scalable inference systems, which cannot be solved by simply throwing more people or money at them [00:32:03].

On-Device vs. Cloud Inference

Running models locally (on desktop or mobile) is often argued for two main reasons:

  1. Cost Saving: Avoiding GPU costs on the cloud [00:33:34].
  2. Privacy: Keeping data on local disk [00:33:42].

However, there are nuances:

  • Mobile Limitations: Offloading compute to mobile is different due to limited power, affecting application metrics like cold start time and power consumption, which impact user adoption [00:34:02]. Models practically deployable on mobile are tiny (1B-10B parameters) with limited capabilities [00:34:33].
  • Privacy Concerns: Much personal data is already on the cloud, making the privacy argument for local execution less straightforward [00:35:05].
  • Desktop Use Cases: For many consumer-facing applications, offloading to desktop makes sense [00:34:54].

Future of AI Development and Research

The future of AI lies in hundreds of small expert models [00:09:59]. This aligns with the open-source community, which provides control and customization opportunities for specialized models through fine-tuning [00:09:15].

Key areas of focus and excitement:

  • Agentic Workflows: The industry is still in the early stages of defining the right user experience and abstraction for agentic workflows [00:27:03].
  • F1 System: Fireworks is building and will generally release F1, a logical reasoning engine that helps understand system abstraction and complexity [00:27:51]. This will lead to developer-facing plugins, allowing developers to build their own F1-like systems [00:28:13].
  • Function Calling: Function calling, crucial for building agents, involves complex multi-turn chat contexts, selecting from hundreds of tools, and often requires parallel and sequential orchestration [00:21:50]. This capability ties together different models and tools [00:24:48].
  • Reasoning Models: Research into different paths to solve reasoning problems is ongoing [00:25:25]. This includes self-inspection techniques like Chain of Thought, as well as new models that can do logical reasoning in latent space, mimicking human thought processes without explicit words [00:25:56].
  • Model-System Co-design: Optimizing across quality, latency, and cost requires thinking about models and systems together [00:45:32].
  • Disruptive Technologies: The search for the “next generation of Transformers” that can fundamentally change how models are trained and inference is performed is a significant area of research [00:46:27].

Industry Shifts

The AI industry is undergoing a revolution driven by accessibility. Before generative AI, companies needed to hire large machine learning teams to train models from scratch, curate data, and invest significant resources [00:42:27]. Generative AI changed this by providing foundation models that absorb most knowledge, allowing companies to build on top of them directly or with small fine-tuning efforts, dramatically lowering the barrier to adoption [00:43:12]. This has led to a much faster adoption curve and shorter sales cycles for AI products [00:51:01].

The shift in ROI is moving from pre-training to post-training, and then to inference [00:37:43]. While pre-training investment will continue until hitting a data wall, there is a clear trend towards optimizing and customizing models for specific uses [00:37:07].