AI Pulse
← Projects · weekend

Video State Tracker CLI

CLI tool that feeds a video + structured question set to a multimodal LLM and scores its temporal state-tracking accuracy against ground-truth annotations.

Difficulty: weekend | Stack: Python, Claude claude-opus-4-5 API (vision), OpenCV, JSONL, rich

Who this is for

Researchers and engineers who want to benchmark their own video + LLM pipelines against VSTAT-style tasks without waiting for official eval infrastructure

Build steps

  1. Define a JSONL schema: {video_path, frame_timestamps, questions: [{t, question, ground_truth}]}
  2. Use OpenCV to extract frames at specified timestamps, encode as base64
  3. Build a prompt template that presents frames in sequence with timestamps and asks the state-tracking question
  4. Send to multimodal LLM API, parse answer, compare to ground truth with exact-match + fuzzy scoring
  5. Output per-question results table via rich, aggregate accuracy by question type (object state / count / spatial)

Risks

  • Long videos exceed context window — need sliding-window frame sampling strategy that doesn’t lose critical state-change frames
  • Ground-truth annotation creation is the real time sink; even a 20-video test set takes hours to label correctly
  • LLM answers are free-form text — robust answer normalization is harder than it looks for spatial/count questions

Business Angle

CLI benchmark tool charging researchers per eval run to score LLM temporal video understanding against ground-truth annotations

Customer: ML engineer at a startup or university lab building video-understanding pipelines — has budget, no time to build eval infra, needs reproducible numbers for paper/demo

Pricing: one-time — $800 in first 3 months via one-time license sales ($49/seat)

Full business breakdown →