Challenges of shipping AI products at the frontier

From: aidotengineer

Shipping products at the frontier of AI involves a constant tension between the opportunities new AI models present and the risks of introducing regressions or unintended consequences to existing products [01:37:44]. The rapid pace of AI model development creates a unique set of challenges for companies building agentic systems [00:00:00] [01:54:49].

The Pain of Progress: An Ever-Evolving Frontier

The rate of progress in AI model development is breathtaking, with major labs like Anthropic, Google Gemini, and OpenAI consistently shipping new models and functionalities [00:30:52] [00:41:20]. While these advances offer incredible new functionalities to integrate into applications, they also come with unexpected and unintended consequences due to the probabilistic nature of AI systems [01:21:05] [01:27:00]. This constant evolution of the underlying tech stack makes product development uniquely challenging [01:46:17] [01:55:09].

Introducing the “Prompt Tax”

As new AI models advance, a “hidden prompt tax” emerges when their functionality is integrated into applications [01:00:04] [01:01:59]. This concept describes the effort required to adapt existing prompts and systems to new models, which can behave in unexpected ways [01:01:59] [01:03:52]. More prompts equate to a higher prompt tax [09:30:00].

The prompt tax is not merely technical debt, which often arises from optimizing for speed with a plan to fix later [10:32:04]. Instead, the prompt tax is driven by the desire to upgrade now to unlock new capabilities [11:00:00]. However, the uncertainties of new models mean developers don’t know exactly what will improve and what will break [11:09:03].

Orbital’s Journey: Shipping at the Frontier

Orbital, a company focused on automating real estate due diligence, provides a practical example of shipping AI products at the frontier [01:56:07]. Their mission is to fast-forward property transactions by helping real estate lawyers quickly find “needles in a haystack” within mountains of paperwork [02:01:09].

Orbital Co-Pilot

Orbital’s first agentic product, Orbital Co-Pilot, was developed in January 2024 and designed to “think like a real estate lawyer” [03:14:02]. The system automates tasks typically performed manually, such as reading documents, OCRing text, structuring information, and compiling reports [03:46:00]. The agentic system breaks down plans into subtasks, each using multiple LLM calls to achieve objectives like finding lease dates or annual rent [04:30:00].

Scaling and Evolution

Orbital has seen significant growth:

From burning less than a billion tokens 18 months ago to consuming almost 20 billion tokens monthly [05:34:00].
From zero revenue 18 months ago to multiple seven figures in annual recurring revenue [06:06:00].

They have migrated through various models, starting with GPT-3.5 and evolving through different versions of GPT-4 and other system 2 models like 01 preview [06:27:00].

Key Decisions in AI Development

Orbital made three crucial decisions to navigate the challenges of AI development:

Optimize for Prompting over Fine-Tuning: This maximizes development speed, allowing real-time adjustments to prompts based on user feedback [07:00:00].
Heavy Reliance on Domain Experts: Private practice real estate lawyers with decades of experience write many of the domain-specific prompts, effectively teaching the AI system their expertise [07:34:00].
“Vibes over Evals”: Instead of a rigorous, objective evaluation system, they rely on subjective human testing by domain experts before release, logging potential regressions but without comprehensive metrics [07:57:00]. This approach has scaled surprisingly well due to real-time user feedback and quick tooling adjustments [08:11:00] [21:14:00].

The Prompting Process

Orbital uses two main types of prompts:

Agentic prompts: Owned by AI engineers, these are system prompts that help the model choose and use tools effectively [08:54:00].
Domain-specific prompts: Created by real estate lawyers, these teach the system expertise in the real estate domain [09:09:00]. Orbital has grown from near zero to over 1,000 domain-specific prompts [09:21:00].

When a new AI model drops, Orbital’s team rigorously experiments with it to unlock new features or gain inspiration [09:39:00]. A key step is calculating the “prompt tax” required to migrate existing prompts to the new model [10:04:00]. There is also an inherent fear of unknown unknowns when shipping a new AI model [10:10:00].

Battle-Tested Tactics for Navigating the Frontier

Orbital has developed several tactics to navigate the challenges and strategies in AI production:

Model Migration Strategy (System 1 vs. System 2) [12:02:00]:
- Specificity: System 1 models required specific instructions on how to accomplish tasks; System 2 models need only a clear objective of what to do [12:12:00].
- Leaner Instructions: Repetitive instructions needed for System 1 models can be removed for System 2 models [12:26:00].
- Unblocking the Model: System 2 models prefer fewer constraints, allowing them time to reason and rationalize [12:40:00].
Leveraging System 1 Models: While System 2 models are favored, System 1 models can be cheaper and faster [13:07:00]. Their “thought tokens” provide valuable explanability for users and aid in debugging when issues arise [13:13:00].
Feature Flags for AI Model Upgrades: Similar to progressive rollouts in software development, feature flags can mitigate risk when introducing new AI models [13:46:00].
Addressing Change Aversion Bias: Users inherently feel more anxiety towards new systems due to unknown potential downsides, even if the previous system had known flaws [14:00:00]. Simply announcing a new model can heighten awareness and lead users to seek out issues [14:37:00].
Team Mantra: “Bet on the Model”: The team’s mantra encourages imagining where AI models will be in 3, 6, or 12 months and building features that will improve as models become smarter, cheaper, and faster [14:56:00]. This future-oriented approach prevents stagnation and fosters growth [15:10:00].
Using System 2 Models for Prompt Migration: New, more capable System 2 models can assist in migrating older, domain-specific prompts, drastically reducing manual human effort [15:44:00].
Making Tough Calls and Shipping: Despite the uncertainty of probabilistic models, teams must be brave enough to take risks, ship, and deal with consequences post-release, mitigating risks along the way [16:11:00].
Strong Feedback Loop: Building systems that allow real-time user feedback (e.g., thumbs up/down) to reach AI engineers and domain experts quickly is crucial [17:10:00]. This enables rapid identification of issues, prompt changes, and deployment of fixes, often within minutes or hours [17:27:00].

The Evolving Landscape and Future of AI Engineering

Demis Hassabis, CEO of Google DeepMind, highlights that the underlying AI tech stack is evolving incredibly fast, making it uniquely challenging for product development across companies of all sizes [18:14:00]. This demands “deeply technical product people” who can predict where the technology will be in a year to design future-proof products [19:05:00].

This environment creates an opportunity for “product AI engineers” who understand both technical capabilities and customer problems, translating model capabilities into solutions for real user problems [19:40:00].

Paying the Prompt Tax: The “Ship Now” Imperative

The overarching question in this environment is: what gives teams more confidence to ship at the AI frontier [20:38:00]? As agentic product surface areas grow, the “vibes over evals” approach may become harder to scale [21:10:00].

The challenges in building AI applications are exacerbated by the complexity of evaluating LLMs, especially in fields like real estate legal, where correctness, style, conciseness, and citation accuracy are all critical [21:49:00]. Creating comprehensive evaluation systems for all edge cases can be prohibitively expensive, slow, and potentially impossible given product velocity [22:11:00].

Progressive Delivery

Progressive delivery offers a potential path forward: “upgrade now and fix on the fly” [22:30:00]. This involves rolling out new models internally first, then to a limited number of progressive users, incrementally scaling up based on feedback [22:42:00]. The goal is to calibrate the rollout based on feedback, scaling up until internal teams are swamped, then dialing back, until feedback is minimal at 100% user adoption [22:50:00].

The central thesis is that to stay on the edge of the AI frontier and maximize opportunities, companies must “buy now” – ship new capabilities into their agentic products and into users’ hands [23:37:00]. While the prompt tax might seem daunting, the anxiety may not always be warranted, and progressive rollout tactics can mitigate downside risks [24:00:00]. Staying on the frontier means prioritizing shipping, even if the “payment” of the prompt tax is determined on a case-by-case basis [24:23:00].

Tubegraph

Explorer

Table of Contents