Video World-State Agent with Persistent Character Memory
A stateful agent that maintains a structured ‘world model’ (characters, props, locations, timeline) across a multi-session video project and uses it to enforce continuity in every new generation call.
Difficulty: 1-month | Stack: Python, FastAPI, Claude API (extended thinking for scene reasoning), Replicate API (video + image gen), SQLite via SQLModel, React + shadcn/ui for project dashboard
Who this is for
Solo filmmakers and YouTube storytellers producing serialized content (web series, explainer sequences) who lose hours to continuity errors — wrong costume, missing prop, changed lighting — between shooting sessions.
Build steps
- Design a world-state schema in SQLite:
characters(name, appearance_description, last_seen_frame_url),locations(name, visual_description, reference_image_url),props,timeline_events. Populate it from the user’s initial project brief via an LLM extraction pass. - Build a ‘continuity injection’ layer: before every video generation API call, query the world state for relevant entities in the scene, have the LLM compose a dense visual-consistency preamble, and prepend it to the generation prompt.
- Add a post-generation verification step: extract a frame from each generated clip, run it through a vision LLM with the world-state description, and flag continuity breaks (e.g., ‘character shirt changed from blue to red’) with a confidence score.
- Implement a refinement loop: if continuity score < threshold, automatically re-generate the clip with an augmented prompt that explicitly corrects the flagged inconsistency, up to 3 retries.
- Build a FastAPI backend exposing project CRUD, generation jobs (queued via background tasks), and world-state inspection endpoints.
- Ship a React dashboard showing the project timeline, each clip’s continuity score, flagged issues with frame thumbnails, and a world-state editor so users can manually correct character descriptions when the model drifts.
Risks
- Vision LLM continuity checks are noisy — false positives will trigger expensive unnecessary re-generations, and false negatives will let real errors through; calibrating the threshold requires a real test set of video pairs which takes time to collect.
- Video generation APIs do not accept reference images as strong conditioning signals in most models today, so the world-state preamble is purely textual and continuity enforcement is probabilistic rather than guaranteed — the core value prop is undermined by model capability limits.
- Managing async generation jobs (retries, partial failures, cost guardrails) for a multi-clip project is a significant infrastructure problem that can easily consume 2-3 weeks of the 1-month budget if not scoped carefully from the start.