Video State Tracker CLI
CLI tool that feeds a video + structured question set to a multimodal LLM and scores its temporal state-tracking accuracy against ground-truth annotations.
Difficulty: weekend | Stack: Python, Claude claude-opus-4-5 API (vision), OpenCV, JSONL, rich
Who this is for
Researchers and engineers who want to benchmark their own video + LLM pipelines against VSTAT-style tasks without waiting for official eval infrastructure
Build steps
- Define a JSONL schema: {video_path, frame_timestamps, questions: [{t, question, ground_truth}]}
- Use OpenCV to extract frames at specified timestamps, encode as base64
- Build a prompt template that presents frames in sequence with timestamps and asks the state-tracking question
- Send to multimodal LLM API, parse answer, compare to ground truth with exact-match + fuzzy scoring
- Output per-question results table via rich, aggregate accuracy by question type (object state / count / spatial)
Risks
- Long videos exceed context window — need sliding-window frame sampling strategy that doesn’t lose critical state-change frames
- Ground-truth annotation creation is the real time sink; even a 20-video test set takes hours to label correctly
- LLM answers are free-form text — robust answer normalization is harder than it looks for spatial/count questions
Business Angle
CLI benchmark tool charging researchers per eval run to score LLM temporal video understanding against ground-truth annotations
Customer: ML engineer at a startup or university lab building video-understanding pipelines — has budget, no time to build eval infra, needs reproducible numbers for paper/demo
Pricing: one-time — $800 in first 3 months via one-time license sales ($49/seat)
Full business breakdown →