From: aidotengineer

An open-source video editing agent has been developed by va [01:08:00]. This agent was created to address the need for an automatic tool to edit videos for rskill, a platform focused on personalized learning [01:14:00].

Development Backstory and Core Libraries

Initial attempts with FFMPEG revealed limitations, prompting a search for more intuitive and flexible alternatives [01:22:00]. While Remotion offered service-side rendering, it proved unreliable [01:30:00]. The core library from Diffusion Studio was chosen due to its intuitive API, which did not necessitate a separate rendering backend [01:35:00]. This led to a collaboration with the library’s author to build the agent [01:40:00].

The core library facilitates complex video compositions via a JavaScript/TypeScript-based programmatic interface [01:52:00]. This means that large language models (LLMs) can be utilized to generate the code required to run compositions [01:54:00].

Why Code for LLM Actions?

Leveraging LLMs to write their own actions in code is considered a perfect match for AI integration because code is the most effective way to express actions performed by a computer [02:02:00]. Additionally, research indicates that LLM tool calling implemented in code is superior to JSON [02:16:00].

Architecture of the AI Video Editing Agent

The current architecture of the AI video editing agent operates as follows:

  1. Browser Session: The agent initiates a browser session using Playwright and connects to an operator UI [02:27:00].
  2. Operator UI: This web application serves as a video editing interface specifically designed for AI agents [02:35:00]. It renders video directly within the browser using the WebCodecs API [02:42:00]. Helper functions are included for file transfers between Python and the browser via the Chromium DevTools protocol [02:46:00].

Agent Workflow and Tools

A typical flow for the agent involves three primary tools [03:00:00]:

  1. Video Editing Tool: Generates code based on a user prompt and executes it in the browser [03:08:00].
  2. Doc Search Tool: If additional context is required, this tool uses RAG (Retrieval Augmented Generation) to retrieve relevant information [03:17:00].
  3. Visual Feedback Tool: After each execution step, compositions are sampled (currently at one frame per second) and fed to this tool [03:25:00]. It functions similarly to a generator and discriminator in a GAN (Generative Adversarial Network) architecture [03:33:00]. Once the visual feedback tool gives a “green light,” the agent proceeds to render the final composition [03:42:00].

lm.txt

An lm.txt file has been shipped, which is analogous to robots.txt but designed for agents [03:55:00]. This file, combined with specific template prompts, is intended to significantly enhance the video editing journey [04:00:00].

Deployment and Future Development

While users can bring their own browser to run the agent, the current setup also supports connecting the agent to a remote browser session via WebSocket [04:11:00]. Each agent can obtain a separate, GPU-accelerated browser session, backed by a load balancer [04:25:00].

The initial version of the agent is implemented in Python, with a TypeScript implementation currently underway [04:40:00]. This aligns with the saying, “Any application that can be written in TypeScript will be written in TypeScript” [04:45:00].

This project is a collaboration between Diffusion Studio and rskill [04:56:00].