Sandboxed Agent Workbench
A local orchestration harness that spins up isolated Docker containers per agent task and tracks memory, tool calls, and session state across runs — a personal Devin-lite infrastructure.
Difficulty: 1-month | Stack: Python, Docker SDK, FastAPI, Redis (session memory), LangGraph, PostgreSQL, React (minimal dashboard)
Who this is for
Developers who want to learn cloud agent infrastructure patterns (sandboxed execution, persistent memory, orchestration) without paying for production cloud infrastructure
Build steps
- Implement a task dispatcher that accepts a natural-language coding task and provisions a fresh Docker container with a cloned repo, pre-installed toolchain, and a time/resource budget
- Build a tool layer inside the container: bash execution, file read/write, and a web-fetch tool — each call logged to PostgreSQL with inputs, outputs, and token cost
- Add a Redis-backed memory layer that persists key facts across sessions (file locations, past decisions, error patterns) and injects relevant context into each new agent turn via semantic search over stored embeddings
- Wire LangGraph to orchestrate multi-step agent loops: plan → act → observe → revise, with a human-in-the-loop checkpoint after every 5 actions that presents a diff for approval before continuing
- Build a minimal React dashboard that shows live container status, a scrollable action log, token spend per task, and a visual diff of repo changes at task completion
- Write an end-to-end test suite that runs three benchmark tasks (add a feature, fix a failing test, refactor a module) and measures success rate, cost, and wall-clock time to establish a personal baseline
Risks
- Docker-in-Docker or host socket mounting for sandboxing creates real security exposure if the agent generates and executes malicious code — network egress rules and read-only mounts must be enforced from the start, not retrofitted
- Long-running agent sessions will exhaust context windows mid-task; implementing mid-session compaction and memory retrieval correctly is significantly harder than it appears and is likely to be the main time sink
- Redis session memory without a strong retrieval strategy will surface irrelevant past context and degrade agent performance — vector embedding and chunking strategy needs to be designed upfront, not bolted on later