AI Pulse
← Projects · 1-week

Benchmark Blindspot Detector

A research tool that ingests any LLM benchmark’s task descriptions and flags which tasks have weak verification properties — predicting where benchmark scores are least trustworthy.

Difficulty: 1-week | Stack: Python, FastAPI, Pandas, Claude API, Streamlit, HuggingFace Datasets

Who this is for

AI researchers and practitioners evaluating which benchmarks to trust when selecting models — it operationalizes Verifier’s Law as a meta-evaluation lens so they can distinguish ‘benchmark solved’ from ‘capability genuinely acquired’.

Build steps

  1. Build a HuggingFace Datasets loader that accepts a dataset name and samples up to 200 tasks, extracting the question, reference answer, and any existing scoring rubric.
  2. For each sampled task, call Claude to rate it across the five Verifier’s Law dimensions (objective truth, verification speed, scalability, noise, reward continuity) with chain-of-thought and a structured JSON output.
  3. Aggregate dimension scores per task and compute a dataset-level ‘verifiability profile’ — a radar chart showing where the benchmark’s verification is strongest and weakest.
  4. Add a ‘description-execution gap’ pass that estimates how hard each task is to describe vs. perform, flagging tasks where the gap is suspiciously small (easy to game by surface-level pattern matching).
  5. Build a Streamlit dashboard showing the radar chart, a sortable task table with scores, and a summary paragraph generated by Claude interpreting the benchmark’s overall trustworthiness.
  6. Export a one-page PDF report per benchmark so researchers can attach it as supplementary material in papers.

Risks

  • Some benchmarks have thousands of tasks — sampling 200 keeps cost manageable but may miss systematic blindspots in tail distributions; make the sample size configurable with a cost estimate shown upfront.
  • Claude’s ratings of verification difficulty are themselves hard to verify (meta-irony) — include inter-rater reliability checks by running a 10-task subset twice and reporting score variance.
  • Benchmark tasks behind paywalls or licenses (e.g., some medical or legal datasets) cannot be loaded via HuggingFace — detect access errors early and surface a clear message rather than silently skipping tasks.