From: aidotengineer
Snapshotting is a crucial feature in AI sandboxes that enables AI agents to manage multi-step workflows and recover from failures [00:02:57]. Arachis, an open-source code execution and computer use sandboxing service, offers out-of-the-box support for this functionality [00:02:53].
Why Snapshotting is Needed
AI agents often struggle with large, complex tasks, such as creating an entire application, as they may fail partway through [00:32:39]. If a multi-stage plan fails, agents shouldn’t have to restart from scratch [00:33:08]. Snapshotting allows agents to:
- Checkpoint Progress [00:04:50]
- Backtrack to the last successful state [00:02:57] [00:33:10]
- Replan and continue the workflow [00:33:12] This capability leads to more reliable execution of complex tasks [00:05:02]. At scale, it allows agents to explore multiple paths in parallel [00:33:24].
How Snapshotting Works in Arachis
Arachis leverages microVMs to provide fast snapshotting capabilities [00:21:43].
Key Components Saved
A snapshot saves the entire running state of an AI sandbox [00:33:30], including:
- Guest Memory: This includes the state of any processes spawned or even GUI windows opened [00:33:34] [00:33:51].
- File System: Specifically, the read-write layer of the overlay file system, which contains any new files created by the agent [00:25:33] [00:33:36]. The base read-only file system layer is shared and protected [00:25:24].
Technical Steps for Snapshotting
The process involves four main steps [00:34:42]:
- Pause the VM: The virtual machine is temporarily paused [00:34:44].
- Dump Guest Memory: The snapshot API is called to save the entire guest memory [00:34:50]. This is more straightforward with microVMs compared to containers [00:21:50].
- Persist Read-Write Layer: The stateful read-write overlay file system is manually persisted [00:34:57].
- Resume the VM: The virtual machine is resumed, continuing its operations from where it paused [00:35:08].
func SnapshotVM(vm_id string, snapshot_id string) error {
// Pause the VM
// ...
// Make sure to resume before exiting the function
defer vm.Resume()
// Create a copy of the stateful disk (read-write layer of overlay FS)
// ...
// Call snapshot API to dump guest memory
// ...
}
Performance
Snapshots in Arachis are very fast, currently completing in single-digit seconds, with ongoing efforts to reduce this time further [00:04:08].
Arachis API for Snapshotting
Arachis provides a simple REST-based API for managing snapshots [00:06:49].
Using the Python SDK, snapshotting is a single command [00:36:26]:
# Snapshot a VM
manager.snapshot(vm_name="my_vm", snapshot_id="snapshot_id")
To restore a VM from a snapshot:
# Restore a VM
manager.restore(vm_name="my_vm", snapshot_id="snapshot_id")
Demonstration
In a demonstration, an AI sandbox was used to create a Google Docs clone [00:37:28]. A snapshot of the initial working clone was taken [00:37:53]. A new “dark mode” feature was then added [00:38:03]. Critically, the system allowed for a restoration to the previous snapshot, effectively reverting the changes and removing the dark mode, without having to re-create the entire application [00:38:19] [00:38:25].
Future Work
Ongoing work in Arachis aims to further enhance snapshotting and persistence [00:39:24]. This includes moving to butterfs
, a file system designed for incremental snapshots [00:39:27] [00:34:21]. Additionally, dynamic memory management and resource management, such as hot plugging or removing memory at runtime, are being explored to allow for more sandboxes to be binned on a single server [00:39:40].