Prompt tax concept in AI development

From: aidotengineer

The “prompt tax” is a concept introduced for builders of agentic systems, referring to the hidden costs and unintended consequences of incorporating new AI model functionalities into applications [00:00:05], [00:01:01]. It arises from the rapid advancement of AI models, which constantly introduce new capabilities but also bring the risk of regressions or unexpected behaviors in probabilistic systems [00:01:21], [00:01:28]. This creates a tension between seizing new opportunities and mitigating risks when shipping products at the AI frontier [00:01:38].

The Pain of Progress

Developing with the rapid pace of AI model progress feels like having a birthday every month, with major labs like Anthropic, Google Gemini, and OpenAI frequently releasing new models and functionalities [00:00:37], [00:00:41]. This constant innovation, while beneficial, leads to the “hidden prompt tax” [00:01:01].

Orbital’s Experience at the AI Frontier

Orbital, a company with offices in New York and London, aims to automate real estate due diligence to expedite property transactions [00:01:56]. Their agentic software helps lawyers by radically reducing the time needed to find “needles in a haystack” within mountains of paperwork and compile necessary documents for real estate transactions [00:02:29].

Orbital Co-Pilot: An Agentic Product

In January 2024, Orbital developed its first agentic product, “Orbital Co-pilot,” designed to think like a real estate lawyer [00:03:14]. The product workflow involves:

Selecting a report type [00:04:01].
Uploading documents (e.g., deed, lease) [00:04:08].
OCR processing of documents with handwritten and typed text [00:04:17].
The agentic system creating a plan, breaking it into subtasks, each acting as its own agentic system with multiple LLM calls [00:04:30].
The system reading legal documents to find specific information (e.g., lease date, annual rent) [00:04:41].
Generating a final report that lawyers can review quickly, with clickable citations to the source documents [00:04:58].
Downloading the report for storage and client delivery [00:05:15].

Since its commercialization 18 months prior, Orbital’s agentic system has seen significant growth:

Token consumption: From less than 1 billion tokens to almost 20 billion tokens per month [00:05:34], [00:05:41]. This 20 billion tokens worth of work was previously done manually by lawyers [00:05:57].
Revenue: Scaled from zero to multiple seven figures in annual recurring revenue [00:06:08].

Model Migration and Key Decisions

Orbital has migrated through various models, starting with GPT-3.5 and progressing through GPT-4 versions (32K, Turbo 40, 4.1) and System 2 models (01 preview to 04 mini) [00:06:27].

Three key decisions were made:

Optimize for prompting over fine-tuning: This maximized the speed of development, allowing real-time prompt adjustments based on user feedback to be quickly incorporated [00:07:00].
Heavy reliance on domain experts: Private practice real estate lawyers with decades of experience write many prompts, teaching the AI system their expertise [00:07:34].
“Vibes over evals”: While an evaluation system is on the roadmap, growth has been driven by subjective human testing by domain experts and user feedback, rather than a rigorous, objective evaluation system [00:07:57]. Issues are sometimes logged in spreadsheets to track regressions from prior model changes, but the process is largely subjective [00:08:38].

Orbital’s prompts are divided into:

Agentic prompts: Owned by AI engineers, these are system prompts that help the model choose and use tools [00:08:54].
Domain-specific prompts: Used by real estate lawyers to instill real estate expertise into the system [00:09:09].

The number of domain-specific prompts has grown from near zero to over 1,000 [00:09:18]. The challenge is that “more prompts equals more prompt tax” [00:09:30].

Dealing with New AI Models

When a new AI model is released, Orbital’s team rigorously experiments with it to:

Unlock new features that have been envisioned [00:09:39].
Determine if the new models are “fit for purpose” [00:09:55].
Get inspired by new ideas [00:09:58].
Assess the “prompt tax” required to migrate existing prompts to the new model [00:10:04].

There is an inherent fear and unknown associated with shipping new AI models, as they can introduce regressions or unintended consequences [00:10:10].

Prompt tax is distinct from technical debt [00:10:32]. Technical debt is often incurred by optimizing for shipping quickly and fixing later, where a feature might never find product-market fit or a prototype becomes core and needs rebuilding [00:10:36]. Prompt tax, however, stems from the desire to upgrade to new models that unlock immense value but come with unknowns about what will improve or break [00:10:58]. The goal is to fix things on the fly by progressively rolling out new models to users, gathering feedback, and quickly addressing issues to maximize benefits and mitigate risks [00:11:16].

Battle-Tested Tactics

Orbital has discovered several tactics over 18 months of migrating models:

Migrating from System 1 (e.g., GPT-40) to System 2 models (e.g., 01 preview):
- System 1 models required specific instructions on how to accomplish tasks [00:12:12].
- System 2 models only need to be told what to do, with less specific or repeated instructions [00:12:19].
- System 2 models prefer minimal constraints, a clear objective, and time to reason [00:12:40].
Leveraging System 1 Models: While System 2 models are favored, System 1 models can be cheaper and faster [00:13:07]. They also often provide “thought tokens” (internal reasoning) that can be embedded for user explainability (especially in complex legal domains) or used for debugging when something goes wrong [00:13:13].
Applying Feature Flags to AI Models: Similar to software development, progressively rolling out AI model upgrades with feature flags helps mitigate risk [00:13:46]. However, there’s a “change aversion bias” and natural anxiety when moving to a new system, as users are hyper-aware of potential issues, which can sometimes outweigh the positives [00:14:00].
“Betting on the Model” Mantra: The team’s mantra is to build for where AI models will be in 3, 6, or 12 months, assuming they will get smarter, cheaper, faster, and more capable [00:14:56]. This ensures features improve as models evolve rather than stagnating [00:15:10].
Using System 2 Models for Prompt Migration: Newer models can help migrate prompts created for older models, radically decreasing manual human effort [00:15:44].
Making Tough Calls and Shipping: Despite the uncertainty of probabilistic models and new capabilities, it’s crucial for teams to take on the risk, ship, and then deal with the consequences, mitigating risks along the way [00:16:10]. Bravery is needed to overcome the inherent anxiety of shipping new models [00:16:56].
Strong Feedback Loops: Implementing rapid feedback loops (e.g., in-product thumbs up/down) is essential. Feedback should quickly reach AI engineers and domain experts, allowing prompt changes to be made and shipped to production within minutes or hours, fixing issues for all users [00:17:10].

The Uniquely Challenging AI Development Environment

Demis Hassabis of Google DeepMind highlights that a key challenge in AI development is the unbelievably fast pace of underlying technology [00:18:14]. Unlike previous revolutionary technologies like the internet or mobile, where the tech stack eventually stabilized, the AI tech stack itself is constantly evolving [00:18:22]. This makes it uniquely challenging for product development, as what one bets on today could be 100% better in a year [00:18:55]. It requires deeply technical product people who can predict where the technology will be in a year to design products that leverage future capabilities [00:19:06]. This environment fosters experimentation, and when something works, there’s a need to “double down quick” [00:19:30].

There is a significant opportunity for “product AI engineers” who understand both customer problems and the capabilities of AI models to turn them into real product features. This direct connection between technical understanding and user needs is incredibly powerful for the future of the AI engineering community [00:19:40].

Paying the Prompt Tax: Shipping with Confidence

The meta-question for AI developers is what gives them more confidence to ship at the frontier as AI advances and agentic product surface areas grow [00:20:36]. While Orbital has built its product mostly on “vibes” (subjective human testing and real-time user feedback) [00:21:03], the question remains whether this approach will scale as the product grows [00:21:10].

Eval Systems vs. Progressive Delivery

An eval system (rigorous, objective evaluation) might be the answer to bending the curve and pushing further [00:21:29]. However, for complex domains like real estate legal, evaluating correctness, style, conciseness, and citations across all edge cases for probabilistic LLMs becomes prohibitively expensive, slow, and potentially an impossible task to keep up with product velocity [00:21:42].

Progressive delivery, or “upgrade now and fix on the fly,” offers a potential way forward [00:22:30]. This involves:

Rolling out internally first [00:22:42].
Then to a limited number of progressive users [00:22:44].
Incrementally scaling up to more users based on feedback [00:22:50].
Dialing back if internal teams are swamped with feedback until feedback is minimal [00:22:59].

The central thesis for staying at the AI frontier is to “buy now” (ship new models and capabilities into the agentic product and to users) [00:23:37]. The anxiety about potential downside risks may not materialize, or they can be managed through progressive rollout and quick fixes [00:24:00]. The emphasis remains on staying at the frontier, accepting that whether the prompt tax needs to be fully paid later is determined on a case-by-case basis [00:24:23].

Tubegraph

Explorer

Table of Contents