Agentic Task Runner with Hardware-Aware Model Routing
A local-first agent framework that automatically routes sub-tasks to the largest model your current hardware can run without hitting swap, falling back to API only when necessary.
Difficulty: 1-week | Stack: Python, llama-cpp-python, MLX-LM, LangGraph, psutil, pydantic, Click
Who this is for
Developers building long-running agentic workflows who want to minimize cloud API calls for cost, latency, and data-residency reasons — the framework extracts maximum value from whatever local hardware is present.
Build steps
- Build a hardware profiler that detects available unified/VRAM memory at runtime and maps it to a ranked list of locally runnable model sizes (e.g., 48 GB unified → can run 30B Q4)
- Implement a model manager that lazily loads/unloads quantized GGUF or MLX models based on profiler output and current memory pressure using psutil
- Define a task graph structure (LangGraph StateGraph) where each node declares a minimum model capability tier (small/medium/large) required
- Write a router that assigns model instances to nodes at graph-compile time, substituting an API call (Anthropic/OpenAI) only when no local model meets the tier requirement
- Add an observability layer that logs per-node: model used, tokens, latency, and whether it was local or API — surfaced as a post-run cost/privacy report
- Package as a pip-installable CLI with a sample multi-step research agent (search → summarize → synthesize) as a demo
Risks
- Memory pressure changes dynamically mid-run (browser opens, another process spikes) — a model loaded at task start may cause OOM partway through a long agent loop
- Model capability ‘tiers’ are subjective; a 7B model may outperform a 13B on specific tasks, making the routing heuristic misleading in practice
- Cross-platform model loading (MLX on Apple vs llama.cpp on Windows/Linux) requires maintaining two separate inference backends with different APIs, significantly increasing maintenance surface