Probe Format Confounder Benchmark
Minimal benchmark that tests whether a linear probe is detecting reasoning type or just task format by swapping MCQ/open-ended wrappers around identical logic problems.
Difficulty: 1-week | Stack: Python, HuggingFace Transformers, scikit-learn, datasets, Qwen3-1.7B or Phi-3-mini
Who this is for
Interpretability researchers who want to validate (or invalidate) probe-based reasoning claims before publishing; engineers building reasoning monitors for production models.
Build steps
- Take 200 logic problems from ARC-Challenge or LogiQA 2.0; rewrite each in 3 surface formats: standard MCQ, yes/no, and open-ended completion — same underlying reasoning required.
- Run all format variants through a small model; capture last-token hidden states at layers 8, 16, and final.
- Train logistic probes to predict ‘reasoning type’ (deductive/inductive/abductive) on one format; test generalization to the other two formats — poor transfer = format confound confirmed.
- Add a control: train probes to predict format label (MCQ vs. open-ended); compare probe accuracy and representation similarity (CKA) between reasoning-type probes and format probes.
- Generate a report showing per-layer accuracy drop when format shifts, and a CKA matrix showing how much reasoning-type probes and format probes share structure.
- Package as a pytest-compatible benchmark so others can swap in their own model checkpoints.
Risks
- Rewriting 200 problems across 3 formats without changing difficulty or reasoning type is labor-intensive — use GPT-4o for drafts but manually verify 20% to catch reasoning-type drift.
- Small models may have low baseline reasoning accuracy, making probe signal too weak — need >65% task accuracy on at least one format to get meaningful hidden-state signal.
- CKA is sensitive to sample size; need at least 150 samples per cell to get stable similarity scores.
Business Angle
SaaS benchmark tool that exposes whether your interpretability probe measures reasoning or just format artifacts
Customer: Academic ML researcher (PhD student or postdoc) running probe-based interpretability experiments on transformer models, preparing a paper for NeurIPS/ICLR, worried a reviewer will ask 'did you control for format?'
Pricing: one-time — $800 in first 3 months (roughly 16 licenses at $49 one-time)
Full business breakdown →