Architecture-Aware Model Router

A drop-in OpenAI-compatible proxy that routes each incoming request to the cheapest model that can meet a declared latency SLA, using live throughput telemetry per model backend.

Difficulty: 1-week | Stack: Python, FastAPI, httpx, Redis, Docker Compose, Prometheus, Grafana

Who this is for

Developers running self-hosted open-weight models who want Command A+‘s production-throughput insight operationalized—automatically shift traffic away from architectures that are saturating under load without changing client code.

Build steps

Stand up a FastAPI server that exposes a /v1/chat/completions endpoint and forwards to multiple backend model servers (Ollama or vLLM instances of different model architectures).
Instrument every backend call with a sliding-window p95 latency tracker stored in Redis; expose these metrics on a /metrics Prometheus endpoint.
Implement a routing policy engine: ‘latency-first’ picks the backend with lowest current p95, ‘cost-first’ picks cheapest backend still under a user-supplied latency_budget_ms header, ‘round-robin’ for baseline comparison.
Add a circuit-breaker per backend that temporarily removes a model from the pool when its error rate or p95 exceeds a threshold—directly replicating the production failure mode MiniMax M2 described for linear attention at load.
Wire up a Grafana dashboard (docker-compose included) showing per-architecture request share, latency distribution, and circuit-breaker events over time so users can see routing decisions in action.

Risks

Redis becomes a single point of failure for routing decisions—if it goes down the proxy must degrade gracefully to round-robin rather than crashing, requiring careful fallback logic.
Latency measurements from the proxy layer include network overhead between containers, which can dwarf actual model inference differences for small requests and mislead the router.
Different model backends return token-count metadata in inconsistent formats (or not at all), making cost-per-token calculations require backend-specific adapters rather than a generic parser.