CLI benchmark tool charging researchers per eval run to score LLM temporal video understanding against ground-truth annotations
Customer: ML engineer at a startup or university lab building video-understanding pipelines — has budget, no time to build eval infra, needs reproducible numbers for paper/demo
Problem: No lightweight, self-hostable way to run VSTAT-style temporal benchmarks locally; official eval infra is slow, gated, or nonexistent for custom datasets
Pricing: one-time — $800 in first 3 months via one-time license sales ($49/seat)
Why now
Embodied AI / video LLM wave hitting simultaneously with benchmark hunger — teams shipping video agents need eval scores fast; Claude claude-opus-4-5 vision + OpenCV makes this buildable in days
Go-to-market
- Post CLI + demo JSONL dataset on Hacker News ‘Show HN’ targeting ML practitioners — link to GitHub with one-command install
- Drop in r/MachineLearning and relevant Discord servers (Eleuther, Alignment Forum) with a short benchmark result screenshot showing temporal accuracy scores
- DM 10-15 authors of recent video-LLM papers on Twitter/X offering free license in exchange for a quote or benchmark comparison
- Write one technical blog post: ‘How we built a 200-line CLI to benchmark video temporal reasoning’ — submit to Towards Data Science or The Gradient
Moat (or lack thereof)
No moat. Anyone with Python and an API key can replicate. Defensibility comes only from: being first result when people search the problem, accumulating community benchmark datasets, and iteration speed on feature requests. Treat it as a lead-gen tool for consulting, not a durable SaaS.