Video State Tracker CLI

CLI tool that feeds a video + structured question set to a multimodal LLM and scores its temporal state-tracking accuracy against ground-truth annotations.

Difficulty: weekend | Stack: Python, Claude claude-opus-4-5 API (vision), OpenCV, JSONL, rich

Who this is for

Researchers and engineers who want to benchmark their own video + LLM pipelines against VSTAT-style tasks without waiting for official eval infrastructure

Build steps

Define a JSONL schema: {video_path, frame_timestamps, questions: [{t, question, ground_truth}]}
Use OpenCV to extract frames at specified timestamps, encode as base64
Build a prompt template that presents frames in sequence with timestamps and asks the state-tracking question
Send to multimodal LLM API, parse answer, compare to ground truth with exact-match + fuzzy scoring
Output per-question results table via rich, aggregate accuracy by question type (object state / count / spatial)

Risks

Long videos exceed context window — need sliding-window frame sampling strategy that doesn’t lose critical state-change frames
Ground-truth annotation creation is the real time sink; even a 20-video test set takes hours to label correctly
LLM answers are free-form text — robust answer normalization is harder than it looks for spatial/count questions

Video State Tracker CLI

Who this is for

Build steps

Risks

Business Angle