LLM Architecture Throughput Benchmarker

A CLI tool that stress-tests multiple open-weight models under concurrent load and surfaces tokens/sec, latency percentiles, and cost-per-token side by side.

Difficulty: weekend | Stack: Python, asyncio, httpx, ollama, rich, plotext

Who this is for

ML engineers and indie hackers choosing between model architectures for a production feature—lets them see Command A+-style throughput wins (or losses) on their own hardware before committing to a stack.

Build steps

Spin up 2-3 open-weight models locally via Ollama (e.g., GLM-4, Llama-3-8B, Mistral-7B) and record their declared architecture type in a config YAML.
Write an async load-runner with httpx that fires N concurrent chat-completion requests at each model endpoint, collecting wall-clock latency and token counts from streaming responses.
Aggregate into p50/p95/p99 latency and tokens-per-second metrics per model, then render a live Rich table and a plotext bar chart in the terminal.
Add a ‘saturation sweep’ mode that increases concurrency from 1→32 and plots where each architecture starts degrading, surfacing the production cliff MiniMax M2 documented.
Export results as a CSV and a static HTML report for sharing with teammates.

Risks

Ollama’s single-process serving masks true architectural throughput differences—you may need llama.cpp server or vLLM for meaningful parallel-request benchmarks.
GPU memory limits on a single dev machine force smaller quantized models, which can change relative rankings versus the full-precision architectures discussed in the blog.
Token counting from streaming chunks is error-prone; off-by-one errors in prompt vs. completion tokens will silently skew tokens/sec numbers.