Benchmark Blindspot Detector
A research tool that ingests any LLM benchmark’s task descriptions and flags which tasks have weak verification properties — predicting where benchmark scores are least trustworthy.
Difficulty: 1-week | Stack: Python, FastAPI, Pandas, Claude API, Streamlit, HuggingFace Datasets
Who this is for
AI researchers and practitioners evaluating which benchmarks to trust when selecting models — it operationalizes Verifier’s Law as a meta-evaluation lens so they can distinguish ‘benchmark solved’ from ‘capability genuinely acquired’.
Build steps
- Build a HuggingFace Datasets loader that accepts a dataset name and samples up to 200 tasks, extracting the question, reference answer, and any existing scoring rubric.
- For each sampled task, call Claude to rate it across the five Verifier’s Law dimensions (objective truth, verification speed, scalability, noise, reward continuity) with chain-of-thought and a structured JSON output.
- Aggregate dimension scores per task and compute a dataset-level ‘verifiability profile’ — a radar chart showing where the benchmark’s verification is strongest and weakest.
- Add a ‘description-execution gap’ pass that estimates how hard each task is to describe vs. perform, flagging tasks where the gap is suspiciously small (easy to game by surface-level pattern matching).
- Build a Streamlit dashboard showing the radar chart, a sortable task table with scores, and a summary paragraph generated by Claude interpreting the benchmark’s overall trustworthiness.
- Export a one-page PDF report per benchmark so researchers can attach it as supplementary material in papers.
Risks
- Some benchmarks have thousands of tasks — sampling 200 keeps cost manageable but may miss systematic blindspots in tail distributions; make the sample size configurable with a cost estimate shown upfront.
- Claude’s ratings of verification difficulty are themselves hard to verify (meta-irony) — include inter-rater reliability checks by running a 10-task subset twice and reporting score variance.
- Benchmark tasks behind paywalls or licenses (e.g., some medical or legal datasets) cannot be loaded via HuggingFace — detect access errors early and surface a clear message rather than silently skipping tasks.