A CLI audit tool that catches LLM ‘lying’ by comparing stated outputs against internal residual-stream representations — sold as a one-time license to ML engineers running pre-deployment model evals.

Customer: Solo ML engineer or small-team AI startup (2–10 people) running final-stage safety/trustworthiness evals before deploying a fine-tuned open-weight model (Llama 3, Mistral, Phi-3 variants) into a product — they have a GPU, know TransformerLens exists, and are under pressure to ship but need a defensible audit trail

Problem: They suspect their fine-tuned model has learned sycophancy or knowledge-editing artifacts — it ‘knows’ the right answer internally but outputs something else under certain prompts — but probing residual streams manually is a multi-day research task, not a 15-minute pre-deploy check

Pricing: one-time — $800 in one-time sales within 3 months (roughly 8–16 licenses at $50–$100 each)

Why now

The mechanistic interpretability wave (Anthropic, Neel Nanda, EleutherAI) has made residual-stream probing a known concept in ML circles but left it as a research artifact — no packaged tool exists. The moment fine-tuners and eval engineers google ‘how do I check if my model is lying internally,’ there’s currently nothing to buy.

Go-to-market

Post a detailed write-up on LessWrong / the Alignment Forum titled ‘I built a CLI to catch when your LLM knows the right answer but says the wrong one’ — link to a free 3-model demo (GPT-2, Llama-3-8B, Mistral-7B) with a paywall for larger models and batch eval mode
Open a thread on EleutherAI Discord and Neel Nanda’s mech-interp Discord showing a live demo catching a known sycophancy case; let the community reproduce it — this is your word-of-mouth engine
Create a $50 Gumroad / Lemon Squeezy listing with a one-page PDF of methodology + the pip-installable package; offer a $100 ‘extended’ tier that includes a 30-min async Q&A session via Loom to help them interpret results for their specific model
DM 5–10 indie fine-tuners actively posting on X/Twitter about deploying open-weight models and offer a free license in exchange for a public testimonial — target accounts with 500–5k followers who are clearly practitioners, not academics

Moat (or lack thereof)

No real moat. A motivated researcher can replicate this in a weekend with TransformerLens. The edge is packaging, documentation, and being first to be findable when someone searches for this specific problem — that’s a temporary distribution advantage, not a defensible barrier. Hugging Face or a well-funded lab could absorb this into an eval library tomorrow. Build it, sell it while the window exists, treat it as a consulting/reputation funnel rather than a durable SaaS.