Hidden-State Lie Detector

A CLI tool that probes an LLM’s internal residual stream to flag when its stated answer contradicts its internal representation.

Difficulty: weekend | Stack: Python, TransformerLens, HuggingFace Transformers, scikit-learn, rich (CLI)

Who this is for

ML engineers and researchers who want to audit model outputs and catch cases where the model ‘knows’ the right answer but says something else — useful for evaluating model trustworthiness before deployment.

Build steps

Load a small open-weight model (e.g. GPT-2-XL or Mistral-7B) via TransformerLens to get hook access to residual stream activations.
Build a dataset of 200-400 multiple-choice QA pairs where ground-truth labels are known (e.g. MMLU subset); run the model and collect final hidden-state vectors at the last token position per layer.
Train a lightweight linear probe (logistic regression via scikit-learn) on mid-layer activations to predict the correct answer class, independent of the model’s output token.
Compare probe prediction vs. model’s actual output token; flag mismatches as ‘internal knowledge suppressed’ cases and log confidence delta.
Build a rich-powered CLI that accepts a question + answer choices and outputs: model answer, probe-predicted answer, agreement status, and top mismatching layers.

Risks

Linear probes may not generalize across question formats — probes trained on one QA style can fail silently on another, giving false confidence in the detector.
Running even a 7B model locally requires a GPU with 16GB+ VRAM; on CPU it will be too slow to be interactive.
TransformerLens hook APIs change between versions and may not support all model architectures out of the box, requiring manual patching.

Hidden-State Lie Detector

Who this is for

Build steps

Risks

Business Angle