Self-hosted benchmark runner that proves your coding agent works across Python, TypeScript, and Go before you ship it to customers.
Customer: Solo AI developer or two-person founding team who has built a custom coding agent (wrapper around Claude/GPT-4o) and is about to pitch it to their first 10 enterprise or dev-tool customers — they need a credible eval story but can’t afford $5k/month for hosted eval platforms.
Problem: They’re demoing their agent on cherry-picked examples and secretly have no idea if it generalises. One bad live demo or a skeptical technical buyer asking ‘what’s your SWE-bench score?’ exposes them. Running SWE-bench themselves is a weekend of Docker pain they haven’t had time for.
Pricing: one-time — $1,200 in first 60 days (12 licenses at $99 one-time), then reassess whether a $19/mo ‘new issues feed’ add-on has legs
Why now
SWE-rebench V2 and the wave of new agent benchmarks (WorldMemArena, SciAgentGym) have made ‘what benchmark did you run?’ a standard due-diligence question in 2025-2026. Indie devs building coding agents now face that question before they have infra to answer it. Hosted eval services (Scale, Braintrust) are priced for Series A companies, not solos.
Go-to-market
- Post a detailed ‘I benchmarked my agent against 50 real GitHub issues across 3 languages — here’s what I learned’ write-up on Hacker Show HN and /r/LocalLLaMA with your own agent as the subject; the tool is the artifact you link at the bottom
- DM 20 people in the last 30 days who posted ‘I built a coding agent’ on X/Twitter or HN — offer a free license in exchange for a 15-min call and a testimonial quote
- List on Gumroad at $99 with a free tier (5 issues, Python only) to lower friction; use the free-tier signups as a warm email list for the paid upsell
- File a GitHub issue or PR on one popular open-source coding-agent repo (e.g., Aider, SWE-agent fork) offering to add your runner as an optional eval harness — free distribution to exactly the right audience
Moat (or lack thereof)
No real moat. This is a dev-tool script that a determined engineer can replicate in a weekend. The only durable advantages are: (1) you ship it before they bother, (2) you accumulate a curated issue set that’s actually solvable and language-balanced (curation takes time), and (3) word-of-mouth from the first 20 buyers who trusted you. Don’t expect a defensible business — expect a $5–15k cash windfall and a portfolio piece that opens consulting conversations.