Opensource Video Editing Agents

From: aidotengineer

Mam introduced the va’s first open-source video editing agent [01:08:00]. The motivation for creating this agent stemmed from the need for an automatic tool to edit videos for reskill, a platform focused on personalized learning while doing [01:12:00].

Development and Core Technology

Initially, the limitations of FFMPEG became apparent, leading the team to seek more intuitive and flexible alternatives [01:22:00]. While Remotion was considered, it presented unreliable service-side rendering [01:30:00]. The core library, on the other hand, was favored for its API, which did not require a separate rendering backend [01:35:00]. This led to a collaboration with the author of the core library from Diffusion Studio to build the agent together [01:40:00].

The core library can perform complex compositions through a JavaScript/TypeScript-based programmatic interface [01:52:00]. This design allows Large Language Models (LLMs) to generate code to run these compositions [01:54:00]. The approach of letting an LLM write its own actions in code is considered a perfect match, as code is the most effective way to express actions performed by a computer [02:00:00]. Research also supports that LLM tool calling via code is superior to JSON [02:13:00].

Architecture

The current architecture of the agent operates as follows:

The agent initiates a browser session using Playwright [02:27:00].
It then connects to operator UI, a web application specifically designed as a video editing interface for AI agents [02:32:00].
operator UI renders video directly in the browser using the WebCodecs API [02:42:00].
Helper functions facilitate file transfers between Python and the browser using the Chromium Dev Tool protocol [02:46:00].

Agent Workflow and Tools

A typical flow for the agent involves three primary tools [03:00:00]:

Video Editing Tool: Generates code based on user prompts and executes it in the browser [03:08:00].
Doc Search Tool: If additional context is required, this tool utilizes RAG to retrieve relevant information [03:14:00].
Visual Feedback Tool: After each execution step, compositions are sampled (currently at one frame per second) and fed into this tool [03:25:00]. This tool functions similarly to a generator and discriminator in a GAN architecture [03:33:00]. Once the visual feedback tool gives a “green light,” the agent proceeds to render the composition [03:42:00].

Agent Directives and Connectivity

The project also includes lm.txt, which is analogous to robots.txt but for agents [03:51:00]. This, combined with specific template prompts, significantly aids in the video editing process [04:00:00].

While users can run the agent with their own browser, the setup is flexible enough to allow the agent to connect to a remote browser session via WebSocket [04:11:00]. Each agent can obtain a separate, GPU-accelerated browser session, supported by a load balancer [04:25:00].

The initial version of the agent is implemented in Python, with a TypeScript implementation currently underway [04:37:00]. This project is a collaboration between Diffusion Studio and Rskill [04:56:00].

Tubegraph

Explorer

Table of Contents

Opensource Video Editing Agents

Development and Core Technology

Architecture

Agent Workflow and Tools

Agent Directives and Connectivity

Graph View

Backlinks