Local Inference Benchmark Dashboard
A cross-platform CLI + web dashboard that benchmarks LLM inference speed, memory bandwidth, and tokens/sec across Apple Silicon, Grace-Blackwell, and CUDA laptops.
Difficulty: weekend | Stack: Python, llama.cpp (Python bindings), MLX (Apple), FastAPI, SQLite, Chart.js
Who this is for
Developers and IT buyers evaluating whether to standardize on Apple Silicon or wait for Grace-Blackwell Windows laptops for local AI workloads — gives them reproducible, apples-to-apples numbers.
Build steps
- Define a standard benchmark suite: 7B, 13B, 30B models at 4-bit quant, measuring time-to-first-token, tokens/sec, and peak RAM usage
- Abstract a runner interface that dispatches to llama.cpp (cross-platform), MLX (Apple Silicon), or CUDA backend depending on detected hardware
- Persist results with hardware fingerprint (chip name, unified memory size, OS) into SQLite so runs are comparable across machines
- Build a FastAPI endpoint that serves results as JSON and a small Chart.js frontend that renders side-by-side bar charts
- Add a one-command upload-and-share flow so community members can submit their own hardware results to a public leaderboard CSV
Risks
- Backend detection logic (MLX vs llama.cpp vs CUDA) is fragile — different driver versions and Python envs can silently fall back to slower paths without warning
- 30B model benchmarks require 20+ GB RAM; many developer laptops will OOM, making the ‘full suite’ promise misleading for most users
- Without Grace-Blackwell hardware available yet, the Windows side of the benchmark is hypothetical — the dashboard may ship before the hardware it’s designed to evaluate