From: aidotengineer
Mam introduced the va’s first open-source video editing agent [01:08:00]. The motivation for creating this agent stemmed from the need for an automatic tool to edit videos for reskill, a platform focused on personalized learning while doing [01:12:00].
Development and Core Technology
Initially, the limitations of FFMPEG became apparent, leading the team to seek more intuitive and flexible alternatives [01:22:00]. While Remotion was considered, it presented unreliable service-side rendering [01:30:00]. The core
library, on the other hand, was favored for its API, which did not require a separate rendering backend [01:35:00]. This led to a collaboration with the author of the core
library from Diffusion Studio to build the agent together [01:40:00].
The core
library can perform complex compositions through a JavaScript/TypeScript-based programmatic interface [01:52:00]. This design allows Large Language Models (LLMs) to generate code to run these compositions [01:54:00]. The approach of letting an LLM write its own actions in code is considered a perfect match, as code is the most effective way to express actions performed by a computer [02:00:00]. Research also supports that LLM tool calling via code is superior to JSON [02:13:00].
Architecture
The current architecture of the agent operates as follows:
- The agent initiates a browser session using Playwright [02:27:00].
- It then connects to
operator UI
, a web application specifically designed as a video editing interface for AI agents [02:32:00]. operator UI
renders video directly in the browser using the WebCodecs API [02:42:00].- Helper functions facilitate file transfers between Python and the browser using the Chromium Dev Tool protocol [02:46:00].
Agent Workflow and Tools
A typical flow for the agent involves three primary tools [03:00:00]:
- Video Editing Tool: Generates code based on user prompts and executes it in the browser [03:08:00].
- Doc Search Tool: If additional context is required, this tool utilizes RAG to retrieve relevant information [03:14:00].
- Visual Feedback Tool: After each execution step, compositions are sampled (currently at one frame per second) and fed into this tool [03:25:00]. This tool functions similarly to a generator and discriminator in a GAN architecture [03:33:00]. Once the visual feedback tool gives a “green light,” the agent proceeds to render the composition [03:42:00].
Agent Directives and Connectivity
The project also includes lm.txt
, which is analogous to robots.txt
but for agents [03:51:00]. This, combined with specific template prompts, significantly aids in the video editing process [04:00:00].
While users can run the agent with their own browser, the setup is flexible enough to allow the agent to connect to a remote browser session via WebSocket [04:11:00]. Each agent can obtain a separate, GPU-accelerated browser session, supported by a load balancer [04:25:00].
The initial version of the agent is implemented in Python, with a TypeScript implementation currently underway [04:37:00]. This project is a collaboration between Diffusion Studio and Rskill [04:56:00].