Managing longrunning agents in production

From: aidotengineer

When agents are ready for a production setting, several considerations arise, particularly concerning human involvement and system resilience [00:00:06]. Key challenges include:

The need for human approval during agent processing steps [00:00:12].
Ensuring agents can run for long periods of time in the presence of failures [00:00:18].
Addressing issues with scaling AI agents in production and distributed environments [00:02:48].

Challenges with Agents in Production

Human in the Loop

A significant concern for agent automation is where human oversight fits into agent processing [00:01:18]. For users to be comfortable with agents, key aspects of their execution require human approval [00:01:26]. This is particularly relevant for high-value or high-risk tasks, such as transferring money or deleting an account [00:01:51]. When an agent reaches such a point, it needs a mechanism to involve a human for final determination [00:01:58].

Longrunning Agents and Failure Tolerance

Many agents are designed to be longrunning workflows in AI deployment [00:02:12]. The more steps involved in agentic processing, the longer a process runs, and the greater the chance of failure [00:02:17]. To prevent the loss of significant work, there’s a need to checkpoint agent state, allowing for resumption from a specific point rather than restarting from the beginning [00:02:24]. This is crucial for sophisticated agents performing complex tasks or extensive research across multiple systems [00:05:17].

Distributed Environments

Increasingly, agents will operate in distributed environments rather than just on a single desktop [00:02:40]. This introduces scaling AI agents in production considerations for reliable agent execution [00:02:48].

Agent Loop Persistence

Standard agent execution often involves an LLM-tool loop [00:03:01]. This loop code must run continuously on a physical machine, whether in the cloud or on a desktop, to interact with the user or third-party channels like Slack [00:06:56]. Most existing frameworks require this agent loop to persist, even when waiting for human input [00:07:23]. This continuous running adds overhead and limits flexibility.

Introducing Agent Continuations

To address these challenges, Snaplogic has developed a new mechanism called agent continuations [00:00:44].

Definition and Inspiration

Agent continuations allow for capturing the full state of complex agents [00:00:50]. This state can be used for:

Arbitrary human-in-the-loop processing [00:00:56].
Providing a basis for reliable agent continuation through snapshots [00:01:02].

The concept is inspired by “continuations” from programming language theory, which allow stopping program execution, bundling up its state, and resuming from that point later [00:08:14].

Core Concepts and State Management

A key insight for agent continuations is that LLM interactions already maintain a “messages array,” which acts as a log of all interactions and is replayed to the LLM for its next inference [00:10:04]. This messages array already saves a significant portion of the agent’s state [00:11:00].

Using Agent Continuations

Standard Agent Execution

In a typical agent framework, tools are defined (e.g., using a decorator for Python functions), and an agent is instantiated with a list of these tools [00:11:33]. A user prompt is sent to the agent, which then performs LLM requests and tool calls to generate a response [00:12:02].

Continuation-Enabled Execution

With continuation support, a tool can be designated as “needing approval” [00:12:37]. Instead of a standard agent, a continuation agent class is used [00:12:50].

When the agent needs to suspend (e.g., for human approval or another condition), the response becomes a continuation object [00:13:16]. This object contains metadata indicating the reason for suspension [00:13:24]. The application layer can then inspect this object, provide input (e.g., human approval), and send the updated continuation object back to the agent [00:13:54]. The agent, recognizing it’s a continuation object, will resume execution from the suspended point [00:14:09].

Crucially, once the agent is suspended and the continuation object is created, the agent loops do not need to keep running; they can be shut down [00:16:54]. The captured information in the continuation object is sufficient to restart everything where it left off [00:17:04].

Structure of the Continuation Object

A continuation object typically wraps the standard messages array and includes additional metadata:

Resume Request: Indicates the exact tool call or point to resume from [00:17:39].
Processed: Populated with the outcome of the suspension, such as approval or disapproval [00:17:50].

For complex, multi-level agents with sub-agents, the continuation object format is recursive, allowing for arbitrary layers of nesting [00:18:20]. This means a nested resume request can contain its own continuations object with its own messages and resume details, preserving the full state of the sub-agent [00:18:34].

Example: Multi-level HR Agent

Consider an HR agent that uses an email tool and an account agent sub-tool responsible for creating accounts and setting privileges [00:18:50]. The account agent, in turn, has create account and authorize account tools, with authorize account requiring approval [00:19:20].

When a user prompts the HR agent to create a new account, the process flows until the sub-agent’s authorize account tool is invoked [00:19:42]. At this point, human approval is needed, causing the agent to suspend and create a nested continuation object [00:19:55]. This object propagates back to the application layer, expanding at each level until the full state is available for inspection and action [00:21:00]. Once the application layer provides the approval, the continuation object is sent back to the HR agent, and the framework restores the agent’s and sub-agent’s states to continue processing [00:21:20].

Implementation and Future Directions

The prototype implementation is built on the OpenAI Python API with no other dependencies [00:24:12]. It is available on GitHub [00:24:20].

Future work aims to implement more general agent suspension beyond just human approval, allowing for arbitrary suspension points based on time, turns, or asynchronous requests [00:24:27]. The goal is not to develop a new, separate agent framework, but to extend existing ones like Strands or PyDantic AI with continuation capabilities [00:24:50].

While other frameworks offer forms of state management, they often lack the explicit human approval element or the sophistication to handle arbitrary depths of nested sub-agents [00:25:03]. The agent continuations approach is novel in combining both a human approval mechanism and arbitrary nesting for complex agents [00:25:36].

This work originated from the Agent Creator research group at Snaplogic, which provides a visual agent building interface and platform [00:25:48]. The continuations were prototyped both at the Python layer and within the higher-level Snaplogic Agent Creator environment [00:26:17].

Conclusion

Agent continuations offer a new mechanism for managing agent state and human-in-the-loop processing for longrunning workflows in AI deployment [00:26:34].

Tubegraph

Explorer

Table of Contents