SaaS benchmark tool that exposes whether your interpretability probe measures reasoning or just format artifacts
Customer: Academic ML researcher (PhD student or postdoc) running probe-based interpretability experiments on transformer models, preparing a paper for NeurIPS/ICLR, worried a reviewer will ask ‘did you control for format?’
Problem: Probe papers get desk-rejected or torn apart in review when they can’t show the probe captures reasoning rather than surface format cues (MCQ vs open-ended). Running the format-swap ablation manually takes 2-3 days of GPU time and glue code most researchers skip.
Pricing: one-time — $800 in first 3 months (roughly 16 licenses at $49 one-time)
Why now
Recent papers (mid-2025) explicitly demonstrated linear probes track task format not reasoning type — this is now a known attack vector reviewers will use. Researchers submitting to NeurIPS 2026 cycle need this ablation in the next 90 days.
Go-to-market
- Post a free open-source version on HuggingFace Hub + arXiv preprint describing the benchmark; link in r/MachineLearning and Alignment Forum. Free version only runs on Qwen3-1.7B to create upgrade pressure.
- Cold-email 30 first authors from probe/steering papers published in 2024-2025 (all public on Semantic Scholar). Offer free license in exchange for a quote if benchmark finds a format confound in their pipeline.
- Sell a $149 ‘reviewer-ready’ tier that auto-generates a Methods section paragraph + figure showing the format-controlled probe comparison — directly paste into LaTeX.
- Post findings as a thread on LessWrong/AF when any notable probe paper drops; offer the benchmark as a free replication kit. Drives inbound from researchers who want to run it on their own model.
Moat (or lack thereof)
No moat. This is a benchmark script, not a platform. Any lab can replicate in a week once they see the idea. The only edge is being first and getting cited — citation gravity is the real lock-in, not tech. Open-source the core and charge for polish and time-savings.