Local Inference Benchmark Dashboard

A cross-platform CLI + web dashboard that benchmarks LLM inference speed, memory bandwidth, and tokens/sec across Apple Silicon, Grace-Blackwell, and CUDA laptops.

Difficulty: weekend | Stack: Python, llama.cpp (Python bindings), MLX (Apple), FastAPI, SQLite, Chart.js

Who this is for

Developers and IT buyers evaluating whether to standardize on Apple Silicon or wait for Grace-Blackwell Windows laptops for local AI workloads — gives them reproducible, apples-to-apples numbers.

Build steps

Define a standard benchmark suite: 7B, 13B, 30B models at 4-bit quant, measuring time-to-first-token, tokens/sec, and peak RAM usage
Abstract a runner interface that dispatches to llama.cpp (cross-platform), MLX (Apple Silicon), or CUDA backend depending on detected hardware
Persist results with hardware fingerprint (chip name, unified memory size, OS) into SQLite so runs are comparable across machines
Build a FastAPI endpoint that serves results as JSON and a small Chart.js frontend that renders side-by-side bar charts
Add a one-command upload-and-share flow so community members can submit their own hardware results to a public leaderboard CSV

Risks

Backend detection logic (MLX vs llama.cpp vs CUDA) is fragile — different driver versions and Python envs can silently fall back to slower paths without warning
30B model benchmarks require 20+ GB RAM; many developer laptops will OOM, making the ‘full suite’ promise misleading for most users
Without Grace-Blackwell hardware available yet, the Windows side of the benchmark is hypothetical — the dashboard may ship before the hardware it’s designed to evaluate