AI Pulse
← Projects · 1-week

Natural-Language Video Edit Agent

An agent that accepts plain-English editing instructions (‘tighten the opening, cut the awkward pause at 0:42, add a zoom on the product’) and executes them as real FFmpeg operations.

Difficulty: 1-week | Stack: Python, Claude API with tool use, FFmpeg (via subprocess), Whisper (OpenAI) for transcript grounding, Gradio for UI

Who this is for

Non-editors (marketers, indie founders, educators) who know what they want changed in a video but don’t want to learn Premiere — they describe the edit in English and get a rendered file back.

Build steps

  1. Transcribe the uploaded video with Whisper to produce a timestamped word-level transcript and scene-change detection heuristic (via FFmpeg scenedetect filter).
  2. Define a tool schema for the LLM agent: cut_segment(start, end), speed_ramp(start, end, factor), add_text_overlay(start, end, text), zoom_crop(start, end, x, y, scale) — each maps to a known FFmpeg filtergraph.
  3. Run an agentic loop: give the LLM the user instruction + transcript + scene list, let it emit a sequence of tool calls, validate each call’s timestamps against actual video duration, and build an FFmpeg filter_complex string.
  4. Execute the final FFmpeg command, render to a new file, and show a diff-style summary (‘removed 3 segments totaling 18s, added 1 text overlay’).
  5. Add a Gradio UI: video upload, instruction text box, rendered output with download button.
  6. Handle the agentic retry case: if FFmpeg errors, feed stderr back to the LLM with ‘fix the filter_complex’ prompt and re-run once.

Risks

  • FFmpeg filter_complex syntax is brittle — the LLM will generate plausible-looking but invalid filtergraphs that silently produce wrong output or hard-crash; robust stderr-feedback loops are non-trivial to get right.
  • Whisper timestamps can drift by 200-500ms on compressed audio, causing ‘cut the pause at 0:42’ to cut the wrong moment; needs a UX affordance to let users nudge timestamps.
  • Scope creep is real — color grading, audio ducking, and B-roll insertion all feel like obvious next features but each is a multi-day rabbit hole that will kill the weekend buffer.