A self-hosted benchmark harness that pits LLMs against hidden-rule text puzzles and gives AI researchers a reproducible leaderboard they can run locally for pennies.
Customer: Independent ML researcher or senior AI engineer at a 5-50 person AI startup who runs model evaluations weekly, is frustrated that MMLU/HumanEval are saturated and gamed, and wants a cheap internal benchmark they control — not a hosted leaderboard they can’t customize.
Problem: Existing reasoning benchmarks are either saturated (models overfit to them), expensive to run via APIs, or locked inside papers with no reusable harness. Researchers waste days wiring up eval scaffolding from scratch every time they want to test a new model or fine-tune.
Pricing: open-core — $800 MRR in 4 months (8 teams × $99/mo for hosted result storage, multi-user dashboards, and private puzzle packs)
Why now
The text-game benchmarking paper plus the broader chain-of-thought compression and logical-consistency research wave signal that the field is actively hunting for better reasoning evals right now — researchers are reading these papers this month and asking ‘how do I test my model on this?‘
Go-to-market
- Post the open-source harness on HuggingFace, GitHub, and r/MachineLearning with a 60-second GIF showing a 3-model head-to-head run in the terminal — target the week a relevant arxiv paper drops to ride its traffic.
- Cold-DM 30 ML Twitter/X accounts who retweeted the text-game benchmarking paper or similar reasoning-eval threads; offer them a free ‘custom puzzle pack’ if they run it against their model and share results publicly.
- Write one detailed blog post on Substack/Towards Data Science: ‘We ran GPT-4o, Claude 3.5, and Llama 3 through 200 hidden-rule puzzles — here’s what we found.’ Link to the repo. This is the top-of-funnel asset.
- Add a one-command ‘share results’ flow that posts an anonymized JSON summary to a public leaderboard URL — every user who opts in becomes a distribution channel when they share their model’s rank on social.
Moat (or lack thereof)
No real moat. The puzzle generation logic and scoring rubric can be replicated in a weekend by any competent ML engineer. The only durable advantages are (1) network effect from a public leaderboard with historical runs across many models, and (2) being the first thing that shows up when someone googles ‘rule induction benchmark Python’ — which is an SEO/timing advantage, not a structural one. Compete on execution speed and community trust, not defensibility.