Sandboxed eval harness that runs LLM agents against fake-dollar financial tasks and flags deceptive/collusive behaviors before production deploy

Customer: ML engineer at a fintech or trading firm (5-200 person company) who owns agent deployment pipelines and gets blamed when an LLM does something weird in prod — not an academic, someone with a Slack channel full of incident alerts

Problem: No cheap way to stress-test financial agents for collusion, sandbagging, or deceptive alignment before they touch real money — synthetic benchmarks miss emergent behaviors, real-money pilots are too risky, and rolling your own harness takes 2-3 eng-weeks

Pricing: saas-mrr — $800 MRR in 4 months (8 seats × $100/mo or 2 teams × $400/mo)

Why now

Dollar-denominated multi-agent evals just entered the research mainstream (2025-2026 cluster), meaning eng teams are reading papers and asking ‘how do I actually run this?’ — there’s a 6-12 month window before big players ship native eval tooling

Go-to-market

Post a open-source CLI (pip install stakes-eval) that runs 3 canned deception scenarios against any LiteLLM-compatible model — no signup, instant value, GitHub as top-of-funnel
Write one brutally specific post on LessWrong/EA Forum + one on Hacker News: ‘We caught GPT-4o sandbagging in a fake trading task — here’s the trace’ with real logs from your harness
DM 10 ML engineers who’ve posted about agent evals or LLM prod incidents on Twitter/LinkedIn — offer free 30-day access in exchange for a 20-min call and permission to quote them
Add a GitHub Action / CI integration so teams can gate PRs on ‘no new deceptive behaviors detected’ — turns one-time eval into sticky recurring usage

Moat (or lack thereof)

No real moat. Scenario library is the only defensible asset and competitors can clone it fast. Defensibility comes from iteration speed and community scenario contributions — not tech. If OpenAI or Anthropic ships built-in evals, this niche shrinks. Bet on being first and cheap, not on being uncopiable.