Domain Capability Ceiling Tracker
A dashboard that monitors benchmark progress across AI capability domains and automatically flags when a domain appears to be hitting a data or evaluation bottleneck
Difficulty: 1-week | Stack: Python, FastAPI, SQLite, APScheduler, React, Recharts
Who this is for
AI researchers and journalists who want an empirical, living view of which capability domains are improving fast versus stalling — directly testing the blog post’s claim that improvement is uneven across domains
Build steps
- Curate a list of 10-15 public AI benchmarks spanning the domain gradient described in the post (MMLU, HumanEval, GSM8K, WMT low-resource tracks, BIG-Bench hard tasks, etc.) and map each to a domain category
- Build a Python scraper/ingestion layer that pulls benchmark leaderboard data from Papers With Code API or static HTML snapshots on a weekly schedule via APScheduler
- Store historical scores per model per benchmark in SQLite; compute a ‘velocity’ metric (score delta per month) and a ‘ceiling proximity’ metric (distance from known human-level or theoretical max)
- Build a FastAPI backend exposing domain-level velocity and ceiling-proximity endpoints, and a React frontend with per-domain sparklines and a sortable table highlighting stalling domains in red
- Add an alerting feature: when a domain’s 3-month velocity drops below a threshold, generate a plain-English ‘bottleneck report’ using a local LLM call (e.g., via Ollama) summarizing why progress may have slowed
Risks
- Papers With Code leaderboards are inconsistently structured; scraping will break as page layouts change — budget significant time for resilient parsing or find a stable API endpoint
- Benchmark saturation (scores near 100%) can look like a ceiling when it actually reflects benchmark exhaustion, not true capability limits — the UI must distinguish these two cases clearly
- Defining ‘domain categories’ is subjective and politically loaded in AI safety circles; document your taxonomy choices explicitly to avoid misleading users