Video World-State Agent with Persistent Character Memory

A stateful agent that maintains a structured ‘world model’ (characters, props, locations, timeline) across a multi-session video project and uses it to enforce continuity in every new generation call.

Difficulty: 1-month | Stack: Python, FastAPI, Claude API (extended thinking for scene reasoning), Replicate API (video + image gen), SQLite via SQLModel, React + shadcn/ui for project dashboard

Who this is for

Solo filmmakers and YouTube storytellers producing serialized content (web series, explainer sequences) who lose hours to continuity errors — wrong costume, missing prop, changed lighting — between shooting sessions.

Build steps

Design a world-state schema in SQLite: characters (name, appearance_description, last_seen_frame_url), locations (name, visual_description, reference_image_url), props, timeline_events. Populate it from the user’s initial project brief via an LLM extraction pass.
Build a ‘continuity injection’ layer: before every video generation API call, query the world state for relevant entities in the scene, have the LLM compose a dense visual-consistency preamble, and prepend it to the generation prompt.
Add a post-generation verification step: extract a frame from each generated clip, run it through a vision LLM with the world-state description, and flag continuity breaks (e.g., ‘character shirt changed from blue to red’) with a confidence score.
Implement a refinement loop: if continuity score < threshold, automatically re-generate the clip with an augmented prompt that explicitly corrects the flagged inconsistency, up to 3 retries.
Build a FastAPI backend exposing project CRUD, generation jobs (queued via background tasks), and world-state inspection endpoints.
Ship a React dashboard showing the project timeline, each clip’s continuity score, flagged issues with frame thumbnails, and a world-state editor so users can manually correct character descriptions when the model drifts.

Risks

Vision LLM continuity checks are noisy — false positives will trigger expensive unnecessary re-generations, and false negatives will let real errors through; calibrating the threshold requires a real test set of video pairs which takes time to collect.
Video generation APIs do not accept reference images as strong conditioning signals in most models today, so the world-state preamble is purely textual and continuity enforcement is probabilistic rather than guaranteed — the core value prop is undermined by model capability limits.
Managing async generation jobs (retries, partial failures, cost guardrails) for a multi-clip project is a significant infrastructure problem that can easily consume 2-3 weeks of the 1-month budget if not scoped carefully from the start.