Role of serverless platforms in AI

From: aidotengineer

The evolution of AI applications has significantly impacted infrastructure requirements, particularly concerning the compute layer [00:00:06]. While traditional web services focused on millisecond response times, AI applications often require runtimes of several seconds or even minutes, leading to new infrastructure challenges [00:01:39].

Evolution of AI Applications and Infrastructure Needs

Initially, AI engineers develop prototypes with simple prompts and tool calls [00:00:31]. However, these prototypes quickly evolve into complex workflows, requiring the transformation of non-deterministic code into deterministic processes [00:00:42]. This involves chaining specific prompts and carefully controlling context, extending workflow runtimes from 30 seconds to several minutes [00:00:56]. Ultimately, AI engineers often transition into data engineers, as the primary challenge becomes gathering and ingesting the correct context for prompts, which may involve crawling user inboxes, ingesting GitHub code, or processing numerous other artifacts requiring extensive LLM processing [00:01:05].

Traditional Web 2.0 infrastructure, designed for API requests and database interactions returning responses in tens of milliseconds, is not suitable for modern AI applications [00:01:28]. AI applications, even with fast models or prompt caches, typically have a P1 latency of a couple of seconds [00:01:39]. The unreliability of current LLM applications, characterized by frequent outages and rate limits, further complicates deployment [00:02:06]. Burst traffic patterns, common during batch processing or new customer onboarding, exacerbate rate limit issues, making it costly to achieve high traffic tiers for experimentation [00:02:55].

Challenges with Traditional Serverless Platforms for AI

While long-running workflows and data engineering have existing tools like SQS, Airflow, or Temporal, these are often designed for different paradigms [00:03:38]. For full-stack developers accustomed to serverless environments, existing serverless providers pose several limitations for long-running AI workflows:

Timeouts: Most time out after 5 minutes [00:04:08].
HTTP Request Limitations: Some limit outgoing HTTP requests [00:04:11].
Lack of Native Streaming Support: Streaming must be bolted on at the application layer, not natively supported by the infrastructure [00:04:18].
No Resumability: Users lose context if they refresh the page or navigate away during a multi-minute process [00:04:29]. This is critical for product experiences like data ingestion during onboarding or multi-step content generation, where waiting 10 minutes for completion can increase user fall-off [00:05:25].

Key Requirements for Serverless AI Infrastructure

For effective AI deployments, especially with agentic workflows, the infrastructure needs to support:

Long-running processes: Workflows can take minutes or even hours [00:06:06].
Resumability: Users should not lose progress if they leave or refresh the page [00:06:08].
Streaming: Both final content and intermediate status updates need to be streamed to keep users engaged [00:06:52].
Transparent Error Handling: Errors should be handled gracefully without forcing users to restart long processes [00:07:04].

Architectural Solution for Agentic Workflows

A tailored serverless platform for long-running workflows and agentic UIs addresses these challenges by separating the API and compute layers [00:10:04].

Separate API and Compute Layers

Independent Scaling: The API layer can scale independently of the compute layer [00:11:00].
Pluggable Compute: The compute layer can be pluggable, allowing users to bring their own compute on top of existing solutions like ECS [00:11:08].
Redis Stream Communication: The API layer invokes the compute layer and passes a Redis stream ID [00:10:23]. All subsequent status and output are communicated via this Redis stream [00:10:32].
Heartbeats: The sandbox program emits heartbeats, allowing background processes to monitor workflow completion and trigger restarts or notifications if a workflow crashes [00:10:41].

Resumability via Redis Streams

The API layer reads directly from the Redis stream, not the compute layer [00:11:29].
This enables UIs that allow users to refresh the page, navigate away, and still retrieve the full history of status messages and output without losing work [00:11:34].

Component Model for AI Workflows

A framework designed for these needs often utilizes an “infra-aware” component model [00:07:32]. This model provides:

Reusable Components: Independent, idempotent, and testable steps [00:08:07]. An example is a wrapped OpenAI SDK call for retries and tracing [00:08:18].
Workflows: Collections of components that run together [00:08:11].
Automatic REST APIs: Workflows can be automatically exposed as REST APIs supporting synchronous and asynchronous invocation, with APIs to retrieve intermediate and final output streams [00:09:11].
Retry and Error Boundaries: Every component includes built-in retry mechanisms and error boundaries, along with comprehensive tracing for debugging [00:08:57].

Lessons for Building AI Infrastructure

When developing infrastructure for agentic workflows:

Start Simple: Don’t over-engineer for hour-long workflows on day one [00:12:13].
Plan for Long-Running Processes: Anticipate that agents will increasingly perform extended, independent work [00:12:19].
Separate Compute and API Planes: This allows for independent scaling and flexibility [00:12:39].
Leverage Redis Streams for Resumability: Make it easy for users to navigate away without losing progress [00:12:41].
Careful Deployment: For long-running workflows, pay close attention to worker draining and blue/green deployment patterns [00:12:54].

While building this infrastructure can be complex, open-source libraries like Genisx aim to simplify the process for developers [00:13:13].

Tubegraph

Explorer

Table of Contents