Probe Format Confounder Benchmark

Minimal benchmark that tests whether a linear probe is detecting reasoning type or just task format by swapping MCQ/open-ended wrappers around identical logic problems.

Difficulty: 1-week | Stack: Python, HuggingFace Transformers, scikit-learn, datasets, Qwen3-1.7B or Phi-3-mini

Who this is for

Interpretability researchers who want to validate (or invalidate) probe-based reasoning claims before publishing; engineers building reasoning monitors for production models.

Build steps

Take 200 logic problems from ARC-Challenge or LogiQA 2.0; rewrite each in 3 surface formats: standard MCQ, yes/no, and open-ended completion — same underlying reasoning required.
Run all format variants through a small model; capture last-token hidden states at layers 8, 16, and final.
Train logistic probes to predict ‘reasoning type’ (deductive/inductive/abductive) on one format; test generalization to the other two formats — poor transfer = format confound confirmed.
Add a control: train probes to predict format label (MCQ vs. open-ended); compare probe accuracy and representation similarity (CKA) between reasoning-type probes and format probes.
Generate a report showing per-layer accuracy drop when format shifts, and a CKA matrix showing how much reasoning-type probes and format probes share structure.
Package as a pytest-compatible benchmark so others can swap in their own model checkpoints.

Risks

Rewriting 200 problems across 3 formats without changing difficulty or reasoning type is labor-intensive — use GPT-4o for drafts but manually verify 20% to catch reasoning-type drift.
Small models may have low baseline reasoning accuracy, making probe signal too weak — need >65% task accuracy on at least one format to get meaningful hidden-state signal.
CKA is sensitive to sample size; need at least 150 samples per cell to get stable similarity scores.

Probe Format Confounder Benchmark

Who this is for

Build steps

Risks

Business Angle