From: aidotengineer
When agents are ready for a production setting, several considerations arise, particularly concerning human involvement and system resilience [00:00:06]. Key challenges include:
- The need for human approval during agent processing steps [00:00:12].
- Ensuring agents can run for long periods of time in the presence of failures [00:00:18].
- Addressing issues with scaling AI agents in production and distributed environments [00:02:48].
Challenges with Agents in Production
Human in the Loop
A significant concern for agent automation is where human oversight fits into agent processing [00:01:18]. For users to be comfortable with agents, key aspects of their execution require human approval [00:01:26]. This is particularly relevant for high-value or high-risk tasks, such as transferring money or deleting an account [00:01:51]. When an agent reaches such a point, it needs a mechanism to involve a human for final determination [00:01:58].
Longrunning Agents and Failure Tolerance
Many agents are designed to be longrunning workflows in AI deployment [00:02:12]. The more steps involved in agentic processing, the longer a process runs, and the greater the chance of failure [00:02:17]. To prevent the loss of significant work, there’s a need to checkpoint agent state, allowing for resumption from a specific point rather than restarting from the beginning [00:02:24]. This is crucial for sophisticated agents performing complex tasks or extensive research across multiple systems [00:05:17].
Distributed Environments
Increasingly, agents will operate in distributed environments rather than just on a single desktop [00:02:40]. This introduces scaling AI agents in production considerations for reliable agent execution [00:02:48].
Agent Loop Persistence
Standard agent execution often involves an LLM-tool loop [00:03:01]. This loop code must run continuously on a physical machine, whether in the cloud or on a desktop, to interact with the user or third-party channels like Slack [00:06:56]. Most existing frameworks require this agent loop to persist, even when waiting for human input [00:07:23]. This continuous running adds overhead and limits flexibility.
Introducing Agent Continuations
To address these challenges, Snaplogic has developed a new mechanism called agent continuations [00:00:44].
Definition and Inspiration
Agent continuations allow for capturing the full state of complex agents [00:00:50]. This state can be used for:
- Arbitrary human-in-the-loop processing [00:00:56].
- Providing a basis for reliable agent continuation through snapshots [00:01:02].
The concept is inspired by “continuations” from programming language theory, which allow stopping program execution, bundling up its state, and resuming from that point later [00:08:14].
Core Concepts and State Management
A key insight for agent continuations is that LLM interactions already maintain a “messages array,” which acts as a log of all interactions and is replayed to the LLM for its next inference [00:10:04]. This messages array already saves a significant portion of the agent’s state [00:11:00].
Using Agent Continuations
Standard Agent Execution
In a typical agent framework, tools are defined (e.g., using a decorator for Python functions), and an agent is instantiated with a list of these tools [00:11:33]. A user prompt is sent to the agent, which then performs LLM requests and tool calls to generate a response [00:12:02].
Continuation-Enabled Execution
With continuation support, a tool can be designated as “needing approval” [00:12:37]. Instead of a standard agent, a continuation agent class
is used [00:12:50].
When the agent needs to suspend (e.g., for human approval or another condition), the response becomes a continuation object [00:13:16]. This object contains metadata indicating the reason for suspension [00:13:24]. The application layer can then inspect this object, provide input (e.g., human approval), and send the updated continuation object back to the agent [00:13:54]. The agent, recognizing it’s a continuation object, will resume execution from the suspended point [00:14:09].
Crucially, once the agent is suspended and the continuation object is created, the agent loops do not need to keep running; they can be shut down [00:16:54]. The captured information in the continuation object is sufficient to restart everything where it left off [00:17:04].
Structure of the Continuation Object
A continuation object typically wraps the standard messages array and includes additional metadata:
- Resume Request: Indicates the exact tool call or point to resume from [00:17:39].
- Processed: Populated with the outcome of the suspension, such as approval or disapproval [00:17:50].
For complex, multi-level agents with sub-agents, the continuation object format is recursive, allowing for arbitrary layers of nesting [00:18:20]. This means a nested resume request
can contain its own continuations object
with its own messages and resume details, preserving the full state of the sub-agent [00:18:34].
Example: Multi-level HR Agent
Consider an HR agent that uses an email tool and an account agent sub-tool responsible for creating accounts and setting privileges [00:18:50]. The account agent, in turn, has create account
and authorize account
tools, with authorize account
requiring approval [00:19:20].
When a user prompts the HR agent to create a new account, the process flows until the sub-agent’s authorize account
tool is invoked [00:19:42]. At this point, human approval is needed, causing the agent to suspend and create a nested continuation object [00:19:55]. This object propagates back to the application layer, expanding at each level until the full state is available for inspection and action [00:21:00]. Once the application layer provides the approval, the continuation object is sent back to the HR agent, and the framework restores the agent’s and sub-agent’s states to continue processing [00:21:20].
Implementation and Future Directions
The prototype implementation is built on the OpenAI Python API with no other dependencies [00:24:12]. It is available on GitHub [00:24:20].
Future work aims to implement more general agent suspension beyond just human approval, allowing for arbitrary suspension points based on time, turns, or asynchronous requests [00:24:27]. The goal is not to develop a new, separate agent framework, but to extend existing ones like Strands or PyDantic AI with continuation capabilities [00:24:50].
While other frameworks offer forms of state management, they often lack the explicit human approval element or the sophistication to handle arbitrary depths of nested sub-agents [00:25:03]. The agent continuations approach is novel in combining both a human approval mechanism and arbitrary nesting for complex agents [00:25:36].
This work originated from the Agent Creator research group at Snaplogic, which provides a visual agent building interface and platform [00:25:48]. The continuations were prototyped both at the Python layer and within the higher-level Snaplogic Agent Creator environment [00:26:17].
Conclusion
Agent continuations offer a new mechanism for managing agent state and human-in-the-loop processing for longrunning workflows in AI deployment [00:26:34].