Linux virtualization and container security

From: aidotengineer

Linux sandboxing is crucial for modern applications, particularly those involving AI agents and untrusted code execution [00:00:07]. This article explores the evolution of Linux sandboxing techniques, from basic execution models to advanced virtualization methods like MicroVMs, emphasizing their role in ensuring security and stability.

Linux Execution Model

The fundamental unit of execution on Linux is a thread [00:08:21]. Each thread has a task_struct in the kernel’s scheduler run queue, which represents the unit of execution [00:08:27]. A process is a logical construct comprising multiple threads that share page tables and other resources [00:08:42].

The kernel provides privileged access to hardware [00:08:56]. To prevent buggy or malicious code from crashing a device or performing harmful actions, special instructions (like int 0x80 for system calls) are required to switch to kernel or supervisor mode for privileged operations [00:09:01].

Containers

Containers offer a solution for packaging an application’s dependencies along with its core logic, enabling arbitrary user code to run on a machine [00:10:13]. This is a core feature needed for an AI sandbox [00:10:26].

Technically, a container on Linux is a collection of namespaces that abstract different resources, such as process, mount, and network [00:10:32]. For example, a container’s process namespace sees its own processes as pid 1, pid 2, etc., while these are arbitrary processes in the root namespace outside the container [00:10:44]. The host can inspect a child container’s namespace, but a container cannot look upwards into its host namespace [00:11:08]. Cgroups are used in conjunction with namespaces to control resource access, such as CPU and memory percentages, allocated to a specific container [00:11:41].

Container Security

Containers run as native processes directly on top of the host kernel [00:12:20]. This means that a kernel vulnerability can be exploited by any malicious or buggy process within a container to gain root access on the host, leading to data exfiltration, system compromise, and other attack vectors [00:12:33].

To mitigate these risks, techniques are used to jail containers by restricting Linux capabilities (caps) and system calls [00:13:21].

Capabilities: Linux capabilities govern which privileged operations a process can perform, allowing only necessary capabilities to be granted [00:13:38].
Seccomp (Secure Computing Mode): Filters arguments to system calls or blocks them entirely, further reducing the attack surface [00:14:01].

Despite these measures, sandboxing and jailing still have limits, and containers can potentially bypass them [00:14:31].

Virtualization on Linux

Virtualization offers a stronger primitive for running untrusted or arbitrary code [00:14:47]. Unlike containers, each Virtual Machine (VM) has its own guest user space and guest kernel, providing isolated environments [00:14:53]. This model presents a significantly smaller attack surface to the host kernel [00:15:10].

VMs access host resources via a Virtual Machine Monitor (VMM), such as QEMU, CrossVM, or Firecracker [00:15:47]. The VMM communicates with /dev/kvm, a Linux kernel device that exposes the processor’s virtualization stack and provides an API for spawning VMs and granting privileged resource access [00:16:01]. When a VM needs to access host resources like disk or network, it triggers a “VM exit” back to the host [00:16:56]. The VMM handles the request with the host kernel and sends the response back to the guest with a “VM resume” [00:17:13]. Minimizing VM exits and resumes is crucial for performance [00:17:22]. While CPU-bound operations within a guest incur no penalty due to direct processor execution, I/O-bound loads can lead to performance trade-offs due to frequent exits [00:17:37].

MicroVMs for Secure AI Sandboxing

MicroVMs are a distinct evolution of virtualization that prioritize security and speed [00:19:39]. The concept originated from the CrossVM project at Chrome OS [00:18:38].

Key differences from traditional VMs include:

Rust-based VMMs: VMMs like CrossVM, Firecracker, and Cloud Hypervisor are written in Rust, providing memory-safe implementations that mitigate memory safety-related bugs often found in C-written devices [00:18:40]. This reduces the attack surface from untrusted guest code to the host [00:18:57].
Jailed Emulated Devices: MicroVMs jail their emulated devices separately. For instance, a block device is restricted to only block-related system calls, preventing network access if compromised, and vice versa [00:19:11].
Performance Optimization (“Micro”): MicroVMs are designed to boot rapidly and consume less memory [00:19:47]. Unlike old VMMs like QEMU that support many architectures and emulated devices, MicroVMs typically support only one or two architectures (Intel, ARM) and major devices, resulting in less code and fewer code paths at boot [00:19:54]. The “micro” refers to the lightweight nature of the VMM process itself, not necessarily the guest [00:20:26].

Arachis’s Choice of MicroVMs

Arachis, an open-source code execution and computer use sandboxing service for AI agents, utilizes a MicroVM runtime as its final execution environment [00:00:04]. The choice of MicroVMs is driven by several design factors:

Security: Paramount for AI sandboxes, especially for multi-tenant code execution where LLM-generated code might access different clients’ data [00:21:02]. Preventing untrusted code from gaining root access is critical [00:21:23].
Fast Boot Times: Arachis currently boots in less than 7 seconds, with ongoing efforts to reduce it to under a second [00:03:54].
Snapshotting: MicroVMs enable fast snapshots by dumping the entire guest memory, allowing agents to backtrack to a good checkpoint if multi-step workflows fail [00:21:42]. This provides more reliable, higher-order complex task execution for agents [00:05:01].

Arachis considered various VMMs:

CrossVM: Initiated the MicroVM revolution [00:22:11].
Firecracker: Underpins AWS Lambda for serverless loads, featuring a fleshed-out REST API and better jailing architecture [00:22:27].
Cloud Hypervisor: A more general-purpose enterprise VMM [00:22:40]. It offered hot plugging of devices, GPU support, and snapshot support at the time of choice [00:22:44]. Its open-source project structure, not controlled by a single company, also made it a sensible choice for Arachis [00:22:58].

Another option for sandboxing is GVisor, which is closer to a container in performance but offers slightly better security [00:23:12]. While untrusted code can still attack the host kernel in GVisor, it can be a good intermediate option, especially for scenarios requiring easier GPU access compared to MicroVMs [00:23:29]. Ultimately, the choice depends on specific needs and security guarantees [00:23:39]. Arachis opted for Cloud Hypervisor as its underlying MicroVM VMM [00:23:45].

Tubegraph

Explorer

Table of Contents