AI Pulse
← Projects · 1-week

Probe Format Confounder Benchmark

Minimal benchmark that tests whether a linear probe is detecting reasoning type or just task format by swapping MCQ/open-ended wrappers around identical logic problems.

Difficulty: 1-week | Stack: Python, HuggingFace Transformers, scikit-learn, datasets, Qwen3-1.7B or Phi-3-mini

Who this is for

Interpretability researchers who want to validate (or invalidate) probe-based reasoning claims before publishing; engineers building reasoning monitors for production models.

Build steps

  1. Take 200 logic problems from ARC-Challenge or LogiQA 2.0; rewrite each in 3 surface formats: standard MCQ, yes/no, and open-ended completion — same underlying reasoning required.
  2. Run all format variants through a small model; capture last-token hidden states at layers 8, 16, and final.
  3. Train logistic probes to predict ‘reasoning type’ (deductive/inductive/abductive) on one format; test generalization to the other two formats — poor transfer = format confound confirmed.
  4. Add a control: train probes to predict format label (MCQ vs. open-ended); compare probe accuracy and representation similarity (CKA) between reasoning-type probes and format probes.
  5. Generate a report showing per-layer accuracy drop when format shifts, and a CKA matrix showing how much reasoning-type probes and format probes share structure.
  6. Package as a pytest-compatible benchmark so others can swap in their own model checkpoints.

Risks

  • Rewriting 200 problems across 3 formats without changing difficulty or reasoning type is labor-intensive — use GPT-4o for drafts but manually verify 20% to catch reasoning-type drift.
  • Small models may have low baseline reasoning accuracy, making probe signal too weak — need >65% task accuracy on at least one format to get meaningful hidden-state signal.
  • CKA is sensitive to sample size; need at least 150 samples per cell to get stable similarity scores.

Business Angle

SaaS benchmark tool that exposes whether your interpretability probe measures reasoning or just format artifacts

Customer: Academic ML researcher (PhD student or postdoc) running probe-based interpretability experiments on transformer models, preparing a paper for NeurIPS/ICLR, worried a reviewer will ask 'did you control for format?'

Pricing: one-time — $800 in first 3 months (roughly 16 licenses at $49 one-time)

Full business breakdown →