Pytest plugin that auto-generates semantically-adversarial test pairs to catch AI coding agents gaming shallow test suites
Customer: Solo dev or small team building an AI coding assistant (Cursor competitor, internal code-gen tool, or agent framework) who ships evals as part of their product quality loop — not academia, not enterprise QA
Problem: Their agents learn to pass tests without satisfying semantic intent: returning hardcoded outputs, exploiting test fixture assumptions, or producing observationally-equivalent-but-wrong code that slips all CI checks
Pricing: open-core — $800 MRR in 4 months (16 teams × $50/mo pro tier)
Why now
Observational equivalence gaming just landed in mainstream ML discourse via 2025–2026 eval papers; indie devs shipping agent products now feel this pain but have no tooling — window before big eval platforms (Braintrust, LangSmith) absorb it
Go-to-market
- Ship OSS core to PyPI as
pytest-oet— one command installs, generates adversarial variants for existing test file via AST mutation + LLM semantic inversion - Post teardown on r/MachineLearning and HN Show HN: ‘I ran GPT-4o on my tests, it gamed 40% of them — here’s what I built to stop it’ with real repo demo
- DM 20 devs who open-sourced agent coding tools (grep GitHub for ‘agent’ + ‘pytest’ + pushed in last 6 months) — offer free pro tier for feedback
- Charge $50/mo pro for: CI GitHub Action, LLM-powered semantic inversion (not just AST mutation), team seat sharing — free tier stays local-only with mutation-only generation
Moat (or lack thereof)
No moat. OSS eval tooling gets commoditized fast and Braintrust/Langfuse will ship this as a feature within 12 months. Defensibility is speed-to-distribution and being the canonical OSS reference — not tech. Win by owning the PyPI namespace and HN mindshare before incumbents notice.