AI Pulse
← Projects · weekend

Local Inference Benchmark Dashboard

A cross-platform CLI + web dashboard that benchmarks LLM inference speed, memory bandwidth, and tokens/sec across Apple Silicon, Grace-Blackwell, and CUDA laptops.

Difficulty: weekend | Stack: Python, llama.cpp (Python bindings), MLX (Apple), FastAPI, SQLite, Chart.js

Who this is for

Developers and IT buyers evaluating whether to standardize on Apple Silicon or wait for Grace-Blackwell Windows laptops for local AI workloads — gives them reproducible, apples-to-apples numbers.

Build steps

  1. Define a standard benchmark suite: 7B, 13B, 30B models at 4-bit quant, measuring time-to-first-token, tokens/sec, and peak RAM usage
  2. Abstract a runner interface that dispatches to llama.cpp (cross-platform), MLX (Apple Silicon), or CUDA backend depending on detected hardware
  3. Persist results with hardware fingerprint (chip name, unified memory size, OS) into SQLite so runs are comparable across machines
  4. Build a FastAPI endpoint that serves results as JSON and a small Chart.js frontend that renders side-by-side bar charts
  5. Add a one-command upload-and-share flow so community members can submit their own hardware results to a public leaderboard CSV

Risks

  • Backend detection logic (MLX vs llama.cpp vs CUDA) is fragile — different driver versions and Python envs can silently fall back to slower paths without warning
  • 30B model benchmarks require 20+ GB RAM; many developer laptops will OOM, making the ‘full suite’ promise misleading for most users
  • Without Grace-Blackwell hardware available yet, the Windows side of the benchmark is hypothetical — the dashboard may ship before the hardware it’s designed to evaluate