From: aidotengineer

This article explores an open-source video editing agent designed to automate video production, addressing the limitations of traditional methods and leveraging AI for enhanced efficiency and flexibility [00:01:06]. The project emerged from a need for an automatic tool to edit videos for reskill, a platform focused on personalized learning [00:01:12].

Evolution of the Toolset

Initially, traditional tools like FFMpeg presented limitations, prompting a search for more intuitive and flexible alternatives [00:01:22]. Remotion was considered but exhibited unreliable server-side rendering [00:01:26]. The Core library proved more suitable due to its API not requiring a separate rendering backend [00:01:35]. This led to a collaboration with the Core library’s author to build the agent [00:01:40].

The Core library, developed by Diffusion Studio, enables complex compositions through a JavaScript/TypeScript-based programmatic interface [00:01:46]. This capability allows Large Language Models (LLMs) to generate code for running the video editing processes [00:01:54].

AI Integration and Code Generation

A core aspect of this agent is the use of LLMs to write their own actions in code [00:02:02]. Code is considered the optimal way to express actions performed by a computer [00:02:08]. Multiple research papers support the idea that LLM tool calling implemented in code is superior to JSON [00:02:13].

Current Architecture

The agent’s architecture involves several key components:

  • Browser Session: The agent initiates a browser session using Playwright [00:02:27].
  • Operator UI: It connects to an Operator UI, which is a web application specifically designed as a video editing interface for AI agents [00:02:32].
  • In-browser Rendering: Video is rendered directly in the browser using the WebCodecs API [00:02:42].
  • File Transfer: Helper functions facilitate file transfer between Python and the browser via the Chromium Dev tool protocol [00:02:46].

Agent Workflow

The typical flow of the agent involves three primary tools [00:03:00]:

  1. Video Editing Tool: This tool generates code based on user prompts and executes it in the browser [00:03:08].
  2. Doc Search Tool: If additional context is required, this tool uses RAG (Retrieval Augmented Generation) to retrieve relevant information [00:03:14].
  3. Visual Feedback Tool: After each execution step, compositions are sampled at one frame per second and fed to this tool [00:03:21]. It functions similarly to a generator and discriminator in a Generative Adversarial Network (GAN) architecture [00:03:33]. Once the visual feedback tool provides a “green light,” the agent proceeds to render the composition [00:03:42].

lm.txt and Flexibility

The system includes lm.txt, which serves a similar purpose to robots.txt but for agents [00:03:51]. When combined with specific template prompts, lm.txt can significantly aid the video editing process [00:04:00].

The setup offers flexibility, allowing users to bring their own browser or connect the agent to a remote browser session via WebSocket [00:04:11]. Each agent can obtain a separate, GPU-accelerated browser session, supported by a load balancer [00:04:25].

The initial version of the agent is implemented in Python, with a TypeScript implementation currently underway [00:04:37]. This project is a collaborative effort between Diffusion Studio and Rskill [00:04:53].