Balancing Progress and Risks in AI

From: aidotengineer

Developing and shipping AI products at the frontier involves a “prompt tax” – the hidden costs and challenges associated with incorporating rapidly evolving AI models into applications [00:00:05]. This concept highlights the tension between the opportunities new AI models offer and the risks of introducing regressions or unintended consequences into products [00:01:40].

The Pain of Progress

The pace of AI model releases is akin to having a birthday every month, with new advancements from major labs like Anthropic, Google Gemini, and OpenAI [00:00:41]. While these advancements in AI offer incredible new functionality to integrate into applications, they also come with “unintended consequences” due to the probabilistic nature of these systems, which can behave unexpectedly [00:01:21].

Understanding the Prompt Tax

The prompt tax is distinct from technical debt [00:10:32]. Technical debt often involves optimizing for quick shipping with the intention to fix later, or dealing with prototypes that become core products and require rebuilding [00:10:36]. In contrast, the prompt tax arises from the desire to upgrade to new models immediately to unlock new capabilities [00:11:00]. The challenge lies in the unknowns: it’s unclear what will improve and what will break with each new model [00:11:10]. This requires a strategy of “fixing on the fly” and optimally releasing new models to users to gather feedback and mitigate risks [00:11:16].

Case Study: Orbital’s Experience

Orbital, a company with offices in New York and London, aims to automate real estate due diligence to expedite property transactions [00:01:56]. This involves using AI to read extensive paperwork and find “red flags” for clients, a process traditionally done manually by real estate lawyers [00:02:17]. Their agentic software, “Orbital Co-pilot,” supercharges this process by radically reducing the time to find crucial information [00:02:29].

Product Demonstration: Orbital Co-pilot

Orbital Co-pilot automates the manual tasks of lawyers, such as reading paperwork and compiling extracted information [00:03:52]. The process includes:

Selecting a report type, such as an occupational lease report [00:04:04].
Uploading documents, like a deed and a lease (e.g., 100 pages total) [00:04:08].
OCR (Optical Character Recognition) of documents with handwritten and typed text to structure them [00:04:17].
The agentic system creating a plan, breaking it into subtasks, and using multiple LLM calls to find objectives (e.g., lease date, annual rent) by reading legal documents [00:04:30].
After completion, a final report can be quickly reviewed by a lawyer, with clickable citations to the “ground truth” documents [00:04:58].
The report can then be downloaded and sent to clients [00:05:17].

Growth and Evolution

Over 18 months, Orbital’s agentic system has scaled significantly:

From burning less than a billion tokens to consuming almost 20 billion tokens monthly [00:05:36]. This represents 20 billion tokens worth of work previously done manually by lawyers, now automated [00:05:57].
Revenue has grown from zero to multiple seven figures in annual recurring revenue [00:06:08].

The company has migrated through various AI models, from GPT-3.5 and GPT-4 32K (which enabled agentic systems due to its increased context window) to 4 Turbo 40, 4.1, and system 2 models like 01 preview and 04 mini [00:06:28].

Key Development Decisions

Orbital made three key decisions in their AI development:

Optimize for Prompting Over Fine-Tuning: This maximized development speed, allowing real-time prompt adjustments based on user feedback to be incorporated quickly, especially for finding product-market fit [00:07:00]. This is a best practice for building AI systems to achieve rapid iteration.
Heavy Reliance on Domain Experts: Private practice real estate lawyers with decades of experience write many of the prompts, imparting their expertise to the AI system [00:07:34].
“Vibes Over Evals”: Despite evaluation systems being popular, Orbital has primarily relied on subjective testing by human domain experts prior to release, focusing on a general “feel” of the system’s performance, sometimes logging regressions in spreadsheets [00:07:58]. This approach, while less rigorous, has supported their rapid growth [00:08:11].

Prompt Management

Orbital categorizes prompts into two areas:

Agentic prompts: Owned by AI engineers, these are system prompts that help the model choose when and which tools to use [00:08:56].
Domain-specific prompts: Used by real estate lawyers, these teach the system its expertise in the real estate domain [00:09:09].

The number of domain-specific prompts has grown from near zero to over 1,000, creating a challenge where “more prompts equals more prompt tax” [00:09:21].

Shipping at the AI Frontier

When a new AI model is released, the process involves rigorous experimentation by AI engineers and domain experts to unlock envisioned features or inspire new ideas [00:09:39]. A key concern is assessing the prompt tax required to migrate existing prompts to the new model [00:10:04]. There is an inherent fear of “unknown unknowns” when shipping a new AI model, which needs to be pinpointed and mitigated [00:10:10].

Battle-Tested Tactics for Managing Prompt Tax

Orbital has developed several best practices for building AI systems and managing the prompt tax over 18 months:

Model Migration Strategies

System 1 vs. System 2 Models: When migrating from System 1 models (e.g., GPT-40) to System 2 models (e.g., 01 preview), the approach to prompts changes:
- System 1 models require specific instructions on how to accomplish tasks [00:12:12].
- System 2 models only need specification of what to do, with leaner prompts and fewer repeated instructions [00:12:19].
- System 2 models thrive when “unblocked” – given clear objectives and time to reason and find appropriate plans without excessive constraints [00:12:40].

Enhancing Explainability and Debugging

Thought Tokens: Even with System 2 models, System 1 models can be useful for their thought tokens, which can be embedded for user explainability (especially in complex domains like real estate law) or used for debugging when something goes wrong [00:13:13].

Mitigating Risk in Deployment

Feature Flags: Similar to software development, feature flags can be used to progressively roll out new AI model upgrades, mitigating risk [00:13:46].
Addressing Change Aversion Bias: There is often more anxiety associated with moving to a new system than staying with a known one, even if the new system has benefits [00:14:00]. Simply announcing a new model can heighten awareness and lead users to seek out issues, potentially outweighing the positives [00:14:30].

Strategic Product Development

“Betting on the Model” Mantra: This involves designing products not just for current AI capabilities but anticipating future improvements in intelligence, cost, speed, and capabilities (3, 6, or 12 months out) [00:14:56]. This allows products to grow with the models rather than stagnating [00:15:15].
Using System 2 Models for Prompt Migration: Newer, more capable models can be used to help migrate domain-specific prompts originally created for older models, significantly reducing manual human effort [00:15:45]. This aids in challenges and innovations in AI engineering.

Decision-Making and Feedback

Making Tough Calls and Shipping: Given the uncertainty of probabilistic models and new capabilities, a team needs to be brave enough to take the risk, ship the product, and deal with consequences post-release, rather than being paralyzed by anxiety [00:16:12].
Strong Feedback Loop: Establishing a rapid feedback loop, whether manual or built into the product’s UX (e.g., thumbs up/down), is crucial [00:17:10]. Feedback should quickly reach AI engineers and domain experts so they can identify prompt changes and ship fixes to production within minutes or hours, resolving issues for all users [00:17:22].

Expert Insights: Demis Hassabis on the Evolving Tech Stack

Demis Hassabis, CEO of Google DeepMind, highlights that the “underlying tech is moving unbelievably fast,” which differs from previous revolutionary technologies like the internet or mobile, where the tech stack eventually stabilized [00:18:17]. This constant evolution makes current AI implementation uniquely challenging for product development, as what is 100% better in a year is unknown [00:18:46]. It necessitates “deeply technical product people” who can predict where the technology will be in a year to design products that leverage future capabilities [00:19:06]. When something works, there’s a need to “double down quick” [00:19:34].

The Future of AI Engineering

There’s a significant opportunity for “product AI engineers” who understand both customer problems and the capabilities of AI models [00:19:41]. This connective tissue between technical understanding and user needs is incredibly promising for turning model capabilities into real-world product features that solve user problems [00:20:00].

Paying the Prompt Tax (or Shipping Now)

The meta-question for shipping at the AI frontier is what provides more confidence [00:20:36]. As AI evolves and agentic product surface areas grow, maintaining confidence is key to continued innovation [00:20:55].

While Orbital has largely built its product on “vibes” complemented by real-time user feedback and rapid tooling, the question remains whether this approach will scale AI solutions in production as the product surface area increases [00:21:03]. An evaluation system could be an answer, potentially alleviating concerns and allowing further progress [00:21:29]. However, evaluating complex aspects like correctness, style, conciseness, and citation accuracy across numerous edge cases for probabilistic LLMs can be prohibitively expensive, slow, and potentially impossible to keep pace with product velocity [00:21:42]. This relates directly to ensuring AI accuracy and reducing errors.

Progressive Delivery

Progressive delivery is a potential way forward, embodying the “upgrade now and fix on the fly” approach [00:22:30]. This involves:

Rolling out internally first [00:22:42].
Then to a limited number of progressive users [00:22:44].
Incrementally scaling up to 50% or 100% of users over time, calibrated by feedback volume [00:22:48]. This minimizes downside risks while maximizing benefits of new AI models [00:11:37].

The central thesis for staying at the edge of the AI frontier and maximizing opportunities is to “buy now” – ship new models into agentic products and user hands quickly [00:23:37]. The anxiety about potential downsides may not materialize, or progressive rollout tactics can manage feedback and fixes incrementally [00:24:00]. This highlights the challenges and opportunities in AI adoption and the challenges and opportunities in AI and agent capabilities.

Tubegraph

Explorer

Table of Contents