From: aidotengineer

While 2025 is considered the “AI agent moment,” with reasoning models outperforming human ability and increasing computational efficiency, AI agents are not yet fully operational in practical applications [03:38:00]. Despite significant advancements, autonomous AI systems encounter various challenges that prevent consistent and reliable performance [04:56:00].

Defining AI Agents

For the purpose of addressing these challenges, an AI agent is defined as a fully autonomous system where large language models (LLMs) direct their own actions [05:12:00].

The Problem of Cumulative Errors

Often, discussions about AI errors focus on “hallucinations” or “fabrications” [06:47:00]. However, a more pervasive issue is the accumulation of “tiny cumulative errors” that compound over multi-step tasks [06:54:00].

Consider a seemingly simple task, like booking a flight from New York to San Francisco with specific constraints (e.g., after 3 PM, avoid rush hour, specific airlines, under $500, aisle seat not near bathroom, before midnight) [05:32:00]. An AI agent might struggle with this complex request, as demonstrated by attempts using an OpenAI operator that failed to meet the criteria [06:06:00].

These errors, even when minor, can lead to significant disparities in performance over multiple steps [09:00:00]. For example, an agent with 99% accuracy might drop to 60% accuracy after 50 consecutive steps, while a 95% accurate agent performs even worse [09:03:00].

Types of Cumulative Errors

  • Decision Error: The AI chooses the wrong fact, such as booking a flight to “San Francisco, Peru” instead of “San Francisco, California” [07:10:00].
  • Implementation Error: The AI encounters wrong access or integration issues, like a CAPTCHA interrupting a flow or getting locked out of a database [07:23:00].
  • Heuristic Error: The AI applies the wrong criteria, such as not accounting for New York City rush hour traffic when booking a 5:30 PM flight from JFK, or not asking the user’s starting borough [07:41:00].
  • Taste Error: The AI fails to account for personal preferences not explicitly stated in the prompt, such as booking a flight on a specific aircraft type the user dislikes [08:03:00].
  • Perfection Paradox: Human expectations are set too high for AI’s magical capabilities, leading to frustration when agents are slow or inconsistent, even if they ultimately get the task right [08:22:00].

Strategies for Mitigating AI Errors

To optimize complex AI agents and ensure consistent, reliable decision-making, several strategies for AI production can be employed [09:30:00]. These strategies for effective AI implementation aim to mitigate cumulative errors and improve overall agent performance.

1. Data Curation

Data is paramount, and its quality and organization directly impact an AI agent’s effectiveness [10:09:00].

  • Comprehensive Data: Data is messy, unstructured, siloed, and includes various modalities beyond text and web, such as design, image, video, audio, sensor, and manufacturing data [10:14:00].
  • Proprietary Data: Focus on curating proprietary data, including data the AI agent generates itself, and data used for quality control within the model workflow [10:32:00].
  • Agent Data Flywheel: Design an agent data flywheel from day one, allowing the product to automatically improve in real-time and at scale every time a user interacts with it [10:49:00]. For example, collecting a curated dataset of a user’s travel preferences, including specific airline and aircraft dislikes, and recycling this content to adapt to preferences over time [11:01:00].

2. Evals (Evaluations)

Rigorous evaluations are crucial for collecting and measuring a model’s responses and choosing the correct answer [11:19:00]. This aligns with strategies for AI evaluation and troubleshooting.

  • Verifiable Domains: Evals are straightforward in verifiable domains with clear yes/no answers, like math and science benchmarks [11:33:00].
  • Non-Verifiable Systems: The challenge lies in setting up evaluations for non-verifiable systems where answers are subjective (e.g., “Did Grace like this plane seat?“) [11:50:00].
  • Human Preferences: It’s essential to collect human preferences and build evaluations that are truly personal [12:27:00]. Sometimes, the best evaluation is simply trying out the agent yourself, relying on “Vibes” based on personal needs rather than numbers or leaderboards [12:33:00].

3. Scaffolding Systems

Building resilient AI workflows requires scaffolding systems to prevent a single error from causing a cascading effect throughout the organization or agentic system [12:44:00].

  • Infrastructure Logic: Implement infrastructure logic to ensure that a failure in a new applied AI feature does not impact the broader production infrastructure [12:56:00].
  • Compound Systems & Human in the Loop: Mitigate scaffolding by building complex compound systems where different components work together, and sometimes, bringing a human back into the loop for reasoning [13:06:00].
  • Self-Healing Agents: For stronger agents, adapt the scaffold to allow them to self-heal and grow, realizing when they are wrong and correcting their own path, or breaking execution when unsure to get back on track [13:18:00]. Checkpoints can be added, for instance, to verify traffic conditions for a flight booking [13:33:00].

4. User Experience (UX)

UX is critical because AI apps often use the same underlying foundation models, which are quickly depreciating assets [13:43:00]. The quality of the product experience and the user workflow distinguishes successful AI applications.

  • Human-Machine Collaboration: Reimagine product experiences to deeply understand user workflows and promote elegant human-machine collaboration [14:02:00].
  • Concrete Examples:
    • Asking clarifying questions to fully understand the user’s intent, like with Deep Research [14:13:00].
    • Understanding the user’s psyche to predict their next step, as seen with Wier from Codium for developers [14:20:00].
    • Seamlessly integrating with legacy systems to create real ROI for users, such as Harvey in the legal world [14:27:00].
  • Proprietary Data & Workflow Knowledge: Companies with proprietary data sources and deep understanding of specific user workflows (e.g., robotics, hardware, defense, manufacturing, life sciences) are well-positioned to create magical experiences for end-users [14:55:00].

5. Multimodality

Moving beyond traditional interfaces like chatbots, multimodality allows for reimagining and creating 10x more personalized user experiences [15:22:00].

  • Human-like Senses: Make AI more human by incorporating senses such as eyes, ears, nose, voice, and touch [15:43:00]. Voice improvements have been significant, and digitizing the sense of smell is also being explored [15:50:00].
  • Embodiment & Memory: Instill a more human feeling and sense of embodiment through robotics [16:01:00]. Furthermore, enable AI to have “memories” to know users on a deeper, truly personal level [16:07:00].
  • Visionary Product Experience: Even if an agent is inconsistent or unreliable, a visionary product that exceeds expectations through novel multimodal experiences can reframe what “perfection” means to a human [16:18:00]. An example is Tlop, which reimagines the visual canvas by implementing AI through brush strokes and combining various AI models seamlessly in the background [16:28:00].

By focusing on these strategies, organizations can work towards building more robust, reliable, and user-friendly AI agents despite the inherent challenges in AI development [16:51:00].