AI Pulse
← Projects · weekend

Hidden-State Lie Detector

A CLI tool that probes an LLM’s internal residual stream to flag when its stated answer contradicts its internal representation.

Difficulty: weekend | Stack: Python, TransformerLens, HuggingFace Transformers, scikit-learn, rich (CLI)

Who this is for

ML engineers and researchers who want to audit model outputs and catch cases where the model ‘knows’ the right answer but says something else — useful for evaluating model trustworthiness before deployment.

Build steps

  1. Load a small open-weight model (e.g. GPT-2-XL or Mistral-7B) via TransformerLens to get hook access to residual stream activations.
  2. Build a dataset of 200-400 multiple-choice QA pairs where ground-truth labels are known (e.g. MMLU subset); run the model and collect final hidden-state vectors at the last token position per layer.
  3. Train a lightweight linear probe (logistic regression via scikit-learn) on mid-layer activations to predict the correct answer class, independent of the model’s output token.
  4. Compare probe prediction vs. model’s actual output token; flag mismatches as ‘internal knowledge suppressed’ cases and log confidence delta.
  5. Build a rich-powered CLI that accepts a question + answer choices and outputs: model answer, probe-predicted answer, agreement status, and top mismatching layers.

Risks

  • Linear probes may not generalize across question formats — probes trained on one QA style can fail silently on another, giving false confidence in the detector.
  • Running even a 7B model locally requires a GPU with 16GB+ VRAM; on CPU it will be too slow to be interactive.
  • TransformerLens hook APIs change between versions and may not support all model architectures out of the box, requiring manual patching.

Business Angle

A CLI audit tool that catches LLM 'lying' by comparing stated outputs against internal residual-stream representations — sold as a one-time license to ML engineers running pre-deployment model evals.

Customer: Solo ML engineer or small-team AI startup (2–10 people) running final-stage safety/trustworthiness evals before deploying a fine-tuned open-weight model (Llama 3, Mistral, Phi-3 variants) into a product — they have a GPU, know TransformerLens exists, and are under pressure to ship but need a defensible audit trail

Pricing: one-time — $800 in one-time sales within 3 months (roughly 8–16 licenses at $50–$100 each)

Full business breakdown →