Hidden-State Lie Detector
A CLI tool that probes an LLM’s internal residual stream to flag when its stated answer contradicts its internal representation.
Difficulty: weekend | Stack: Python, TransformerLens, HuggingFace Transformers, scikit-learn, rich (CLI)
Who this is for
ML engineers and researchers who want to audit model outputs and catch cases where the model ‘knows’ the right answer but says something else — useful for evaluating model trustworthiness before deployment.
Build steps
- Load a small open-weight model (e.g. GPT-2-XL or Mistral-7B) via TransformerLens to get hook access to residual stream activations.
- Build a dataset of 200-400 multiple-choice QA pairs where ground-truth labels are known (e.g. MMLU subset); run the model and collect final hidden-state vectors at the last token position per layer.
- Train a lightweight linear probe (logistic regression via scikit-learn) on mid-layer activations to predict the correct answer class, independent of the model’s output token.
- Compare probe prediction vs. model’s actual output token; flag mismatches as ‘internal knowledge suppressed’ cases and log confidence delta.
- Build a rich-powered CLI that accepts a question + answer choices and outputs: model answer, probe-predicted answer, agreement status, and top mismatching layers.
Risks
- Linear probes may not generalize across question formats — probes trained on one QA style can fail silently on another, giving false confidence in the detector.
- Running even a 7B model locally requires a GPU with 16GB+ VRAM; on CPU it will be too slow to be interactive.
- TransformerLens hook APIs change between versions and may not support all model architectures out of the box, requiring manual patching.
Business Angle
A CLI audit tool that catches LLM 'lying' by comparing stated outputs against internal residual-stream representations — sold as a one-time license to ML engineers running pre-deployment model evals.
Customer: Solo ML engineer or small-team AI startup (2–10 people) running final-stage safety/trustworthiness evals before deploying a fine-tuned open-weight model (Llama 3, Mistral, Phi-3 variants) into a product — they have a GPU, know TransformerLens exists, and are under pressure to ship but need a defensible audit trail
Pricing: one-time — $800 in one-time sales within 3 months (roughly 8–16 licenses at $50–$100 each)
Full business breakdown →