Natural-Language Video Edit Agent
An agent that accepts plain-English editing instructions (‘tighten the opening, cut the awkward pause at 0:42, add a zoom on the product’) and executes them as real FFmpeg operations.
Difficulty: 1-week | Stack: Python, Claude API with tool use, FFmpeg (via subprocess), Whisper (OpenAI) for transcript grounding, Gradio for UI
Who this is for
Non-editors (marketers, indie founders, educators) who know what they want changed in a video but don’t want to learn Premiere — they describe the edit in English and get a rendered file back.
Build steps
- Transcribe the uploaded video with Whisper to produce a timestamped word-level transcript and scene-change detection heuristic (via FFmpeg scenedetect filter).
- Define a tool schema for the LLM agent:
cut_segment(start, end),speed_ramp(start, end, factor),add_text_overlay(start, end, text),zoom_crop(start, end, x, y, scale)— each maps to a known FFmpeg filtergraph. - Run an agentic loop: give the LLM the user instruction + transcript + scene list, let it emit a sequence of tool calls, validate each call’s timestamps against actual video duration, and build an FFmpeg filter_complex string.
- Execute the final FFmpeg command, render to a new file, and show a diff-style summary (‘removed 3 segments totaling 18s, added 1 text overlay’).
- Add a Gradio UI: video upload, instruction text box, rendered output with download button.
- Handle the agentic retry case: if FFmpeg errors, feed stderr back to the LLM with ‘fix the filter_complex’ prompt and re-run once.
Risks
- FFmpeg filter_complex syntax is brittle — the LLM will generate plausible-looking but invalid filtergraphs that silently produce wrong output or hard-crash; robust stderr-feedback loops are non-trivial to get right.
- Whisper timestamps can drift by 200-500ms on compressed audio, causing ‘cut the pause at 0:42’ to cut the wrong moment; needs a UX affordance to let users nudge timestamps.
- Scope creep is real — color grading, audio ducking, and B-roll insertion all feel like obvious next features but each is a multi-day rabbit hole that will kill the weekend buffer.