AI Pulse
← Projects · 1-week

Architecture-Aware Model Router

A drop-in OpenAI-compatible proxy that routes each incoming request to the cheapest model that can meet a declared latency SLA, using live throughput telemetry per model backend.

Difficulty: 1-week | Stack: Python, FastAPI, httpx, Redis, Docker Compose, Prometheus, Grafana

Who this is for

Developers running self-hosted open-weight models who want Command A+‘s production-throughput insight operationalized—automatically shift traffic away from architectures that are saturating under load without changing client code.

Build steps

  1. Stand up a FastAPI server that exposes a /v1/chat/completions endpoint and forwards to multiple backend model servers (Ollama or vLLM instances of different model architectures).
  2. Instrument every backend call with a sliding-window p95 latency tracker stored in Redis; expose these metrics on a /metrics Prometheus endpoint.
  3. Implement a routing policy engine: ‘latency-first’ picks the backend with lowest current p95, ‘cost-first’ picks cheapest backend still under a user-supplied latency_budget_ms header, ‘round-robin’ for baseline comparison.
  4. Add a circuit-breaker per backend that temporarily removes a model from the pool when its error rate or p95 exceeds a threshold—directly replicating the production failure mode MiniMax M2 described for linear attention at load.
  5. Wire up a Grafana dashboard (docker-compose included) showing per-architecture request share, latency distribution, and circuit-breaker events over time so users can see routing decisions in action.

Risks

  • Redis becomes a single point of failure for routing decisions—if it goes down the proxy must degrade gracefully to round-robin rather than crashing, requiring careful fallback logic.
  • Latency measurements from the proxy layer include network overhead between containers, which can dwarf actual model inference differences for small requests and mislead the router.
  • Different model backends return token-count metadata in inconsistent formats (or not at all), making cost-per-token calculations require backend-specific adapters rather than a generic parser.