From: aidotengineer

Natural Language Processing (NLP) is a field that is home to a paper described as “legendary” within its domain: “Attention Is All You Need” [00:00:30]. This paper is not just a catchy title but a foundational work in NLP [00:00:39].

AI Video Editing Agent

A novel open-source video editing agent has been developed, addressing the need for an automatic tool to edit videos for ReSkill, a personalized learning platform [00:01:12]. Initial attempts highlighted limitations with FFMpeg, leading to a search for more intuitive and flexible alternatives [00:01:22]. While Remotion offered server-side rendering, it proved unreliable [00:01:30].

The development team found success with the “core” library from Diffusion Studio, appreciating its API that didn’t require a separate rendering backend [00:01:32]. A collaboration with the library’s author led to the creation of this agent [00:01:40].

Leveraging Large Language Models in Code Generation

The “core” library enables complex compositions through a JavaScript/TypeScript-based programmatic interface [00:01:46]. This capability allows for the use of LLMs to generate code to run these compositions [00:01:54].

A key insight is that if an LLM can write its own actions in code, it forms a perfect match, as code is considered the best way to express computer-performed actions [00:02:00]. Research papers have also demonstrated that LLM tool calling via code is superior to JSON-based methods [00:02:13].

Agent Architecture and Workflow

The current architecture of the agent operates as follows [00:02:27]:

  1. The agent initiates a browser session using Playwright and connects to an “operator UI” [00:02:27].
  2. This web app serves as the video editing UI, designed specifically for AI agents, rendering video directly in the browser using the WebCodecs API [00:02:35].
  3. Helper functions facilitate file transfer between Python and the browser via the Chromium DevTools protocol [00:02:46].

The agent’s typical workflow involves three main tools [00:03:00]:

  • Video Editing Tool: Generates code based on user prompts and executes it in the browser [00:03:08].
  • Doc Search Tool: Utilizes RCK to retrieve relevant information if additional context is needed [00:03:14].
  • Visual Feedback Tool: After each execution step, compositions are sampled (currently at one frame per second) and fed to this tool [00:03:22]. It functions similarly to a generator and discriminator in a Generative Adversarial Network (GAN) architecture [00:03:33]. Once the visual feedback tool gives a “green light,” the agent proceeds to render the composition [00:03:42].

Integrating AI into Natural Workflows

The concept of llm.txt is introduced, functioning like robots.txt but for AI agents [00:03:51]. This, combined with specific template prompts, is designed to significantly aid the video editing process [00:04:00].

While users can run the agent with their own browser, the setup supports connecting the agent to a remote browser session via a WebSocket [00:04:11]. Each agent can obtain a separate, GPU-accelerated browser session, backed by a load balancer [00:04:25].

The initial version of the agent is implemented in Python, with a TypeScript implementation currently underway [00:04:37]. This aligns with the saying: “any applications that can be written in TypeScript will be written in TypeScript” [00:04:45].

This project is a collaboration between Diffusion Studio and RSkill [00:04:56].