Hosted benchmark SaaS that measures how many in-context examples an LLM agent needs to reliably invoke a new tool — so teams skip the guesswork before picking fine-tune vs. few-shot architecture

Customer: ML engineer at a 2-10 person AI startup who is building a product-facing agent and needs to decide whether to fine-tune a model on proprietary tool schemas or rely on few-shot prompting — they have a GitHub account, read agent papers on weekends, and are blocked by lack of empirical data

Problem: No easy way to measure few-shot tool-learning curve for a specific tool schema + model combo. Engineers guess, ship the wrong architecture, then spend weeks debugging hallucinated tool calls in prod

Pricing: saas-mrr — $800 MRR in 4 months (8 teams × $99/mo)

Why now

Multi-agent architecture debate (few-shot vs fine-tune, disentangled loops, ontology grounding) is peaking in 2025-2026 — teams are making irreversible infra bets without empirical backing, and Claude/OpenAI APIs now expose tool-use natively making benchmarking tractable

Go-to-market

Post open-source CLI version (the Python/pytest harness) on HN ‘Show HN’ with real benchmark results on 3 popular tool schemas (Stripe, GitHub, custom CRUD) — capture emails from people who star/fork
DM 20 founders in Latent Space Discord / AI Engineer Slack who have posted about tool-calling pain; offer free benchmark run of their tool schema in exchange for a 15-min feedback call
Write one concrete blog post: ‘We ran 500 tool-teaching experiments so you don’t have to’ with Plotly charts showing sample-efficiency curves by model — this is the SEO/content wedge
Add a hosted UI on top of the OSS CLI (upload tool schema JSON, pick models, get curve + recommendation) — charge $99/mo for >10 schemas/month or private result storage

Moat (or lack thereof)

No meaningful moat. Any team can replicate the benchmark harness in a weekend. Defensibility is purely dataset accumulation (aggregated benchmark results across hundreds of tool schemas become a reference corpus) and brand as ‘the place that published the data first’ — classic indie-hacker first-mover-in-a-niche play, not a durable technical moat