From: aidotengineer

Introduction

The rise of AI agents and complex AI applications has introduced significant challenges for traditional infrastructure, particularly at the compute layer [00:00:04]. Unlike Web 2.0 services designed for low-latency, stateless requests, modern AI applications often involve longrunning workflows that can extend from minutes to hours [00:02:22]. This shift necessitates a re-evaluation of infrastructure design to support these extended execution times and ensure reliability [00:02:01].

The Evolution of AI Workflows

The journey of an AI engineer often begins with simple prompts and tool calls, which can quickly evolve into complex workflows [00:00:42]. Initially, the focus is on making non-deterministic AI code as deterministic as possible through chaining tailored prompts and controlling context [00:00:44]. This process can extend workflow runtimes from 30 seconds to several minutes [00:00:59]. Ultimately, many AI engineers find themselves tackling data engineering challenges, as the most difficult aspect becomes providing the correct context to prompts, often requiring extensive LM processing of diverse data sources like inboxes or GitHub code [00:01:08].

Infrastructure Challenges for AI Applications

Traditional Web 2.0 infrastructure, designed for API requests completed in tens of milliseconds, is ill-suited for current AI applications [00:01:26]. AI application latency, even with fast models or prompt caches, typically ranges from a few seconds to minutes [00:01:39]. This means that an alert-triggering latency in the past is now a best-case scenario [00:01:51].

Key challenges include:

  • Reliability: Building reliable AI applications is difficult due to frequent outages and rate limits from underlying dependencies like OpenAI and Gemini [00:02:09]. Outages can coincide across providers, and bursty traffic patterns from batch processing or new customer onboarding often hit rate limits unless significant investment is made in higher tiers [00:03:13].
  • Long Runtimes: Workflows that run for minutes or hours strain infrastructure designed for short, synchronous interactions [00:02:22]. This often forces AI engineers to become data engineers to manage these processes effectively [00:03:26].

Existing Solutions and Their Limitations

For managing longrunning agents in production and workflows, existing data engineering tools are often employed:

  • Queues: Tools like SQS are used to build complex, “Rube Goldberg” machines for process orchestration [00:03:42].
  • Batch Processing Tools: Airflow is a common choice for batch processing [00:03:44].
  • Durable Execution Engines: Temporal provides capabilities for durable execution [00:03:46].

However, these tools are often not ideal for full-stack AI engineers, especially those preferring modern web technologies like TypeScript [00:03:50].

Serverless Providers (e.g., AWS Lambda, Vercel Functions) also have limitations for longrunning workflows [00:04:04]:

  • Timeouts: Most time out after 5 minutes [00:04:08].
  • HTTP Request Limits: Some limit outgoing HTTP requests [00:04:11].
  • Lack of Native Streaming Support: Streaming usually needs to be bolted on at the application layer, not natively supported by the infrastructure [00:04:18].
  • No Resumability: If a user refreshes a page or leaves, the context and progress are lost, which is critical for multi-minute processes [00:04:29].

Use Cases for Longrunning Workflows

Longrunning workflows are essential for various AI product experiences:

  1. Onboarding/Data Ingestion:

    • Users provide a URL, initiating a single LLM call to extract initial information and identify pages for scraping [00:04:53].
    • A background scraping job then runs for multiple minutes, making hundreds of LLM calls to extract and enrich content [00:05:06].
    • The goal is to allow users to use the product immediately while ingestion happens in the background, showing status updates to prevent fall-off in the funnel [00:05:15].
  2. Content Generation Agents:

    • An AI agent might generate a blog post, a process explicitly communicated to the user as taking several minutes [00:05:43].
    • The user is shown intermediate steps like research, outlining, and section writing [00:05:58].
    • Crucially, if the user leaves or navigates away, they should not lose context, and the process should be resumable [00:06:10].
    • Streaming of both final content and intermediate status is vital, along with transparent error handling so users don’t lose progress and get frustrated [00:07:11].

Architectural Solutions for Longrunning Workflows

A robust approach to developing AI agents and agentic workflows involves an infrastructure-aware component model:

Component Model

  • Infrastructure Awareness: The framework is aware of the infrastructure it runs on, and vice-versa, enabling features like resumable streams for status and output [00:07:38].
  • Building Blocks: Adopting an “anti-framework” philosophy, it focuses on reusable, idempotent, and independently testable components [00:07:50].
  • Components: These are simple functions, often wrapping SDKs (like OpenAI’s) to provide tooling for retries and tracing, taking prompts/context and returning responses [00:08:07].
  • Workflows: Collections of components that run together, with each component providing a retry boundary and error boundary [00:08:44].

Key Features

  • Tracing: Workflows provide detailed traces of nested components, including token usage and OpenAI call details, simplifying debugging [00:09:29].
  • Built-in Retries: Components can be configured with fluent APIs for things like exponential retry policies and caching, addressing specific problematic components [00:09:44].
  • Automatic REST APIs: Workflows can automatically be exposed as REST APIs supporting synchronous and asynchronous invocation, with APIs to retrieve intermediate and final output streams [00:09:11].

Infrastructure Architecture (Example: Genisys)

A tailored serverless platform for longrunning workflows can separate the API and compute layers [00:10:04]:

  • API Layer: Invokes the compute layer and passes a Redis stream ID [00:10:20].
  • Compute Layer: Executes the sandbox program. Communication after initial invocation happens via Redis streams [00:10:32].
  • Redis Streams: Used for transmitting status, output, and heartbeats from the executing sandbox program [00:10:33]. This allows background processes to monitor workflow completion and automatically restart or notify users if a workflow dies [00:10:43].

This separation offers several benefits:

  • Independent Scaling: The API and compute layers can scale independently [00:11:00].
  • Pluggable Compute: The compute layer can be stateless and pluggable, allowing users to bring their own compute [00:11:07].
  • Resumability: Since the API layer only reads from the Redis stream (not directly from compute), UIs can be built to support refreshing the page, navigating away, and transparently handling errors while maintaining the full history of status messages and output [00:11:27]. This ensures no work is lost even if the user or browser connection is terminated [00:11:59].

Key Considerations for Building Resilient AI Workflows

When developing AI agents and agentic workflows that involve longrunning processes, consider the following:

  • Start Simple, Plan for Longrunning: Begin with simple workflows but design the infrastructure with the future in mind, anticipating that agents will increasingly handle complex, extended tasks [00:12:13].
  • Separate Compute and API Planes: Keep these layers distinct and leverage Redis streams for resumability [00:12:39].
  • User Experience: Make it easy for users to navigate away without losing progress and handle errors transparently [00:12:44].
  • Deployment Care: When workflows can run for 60 minutes, meticulous attention must be paid to deployment patterns like draining workers and blue-green deployments to avoid disruptions [00:12:53]. Scaling AI agents in production and managing these long-running processes requires careful consideration of these details.
  • Complexity: While easy to prototype, getting longrunning AI in workflow automation right in production is challenging [00:13:06].

Cost and latency optimization in AI deployments

While not the primary focus, the design choices for scaling AI agents in production and supporting longrunning workflows inherently contribute to cost and latency optimization in AI deployments by ensuring efficient resource use, managing rate limits, and improving overall system reliability.

For those not wishing to build this infrastructure from scratch, open-source libraries like Genisys implement many of these architectural patterns [00:13:13].