Sandboxed Agent Workbench

A local orchestration harness that spins up isolated Docker containers per agent task and tracks memory, tool calls, and session state across runs — a personal Devin-lite infrastructure.

Difficulty: 1-month | Stack: Python, Docker SDK, FastAPI, Redis (session memory), LangGraph, PostgreSQL, React (minimal dashboard)

Who this is for

Developers who want to learn cloud agent infrastructure patterns (sandboxed execution, persistent memory, orchestration) without paying for production cloud infrastructure

Build steps

Implement a task dispatcher that accepts a natural-language coding task and provisions a fresh Docker container with a cloned repo, pre-installed toolchain, and a time/resource budget
Build a tool layer inside the container: bash execution, file read/write, and a web-fetch tool — each call logged to PostgreSQL with inputs, outputs, and token cost
Add a Redis-backed memory layer that persists key facts across sessions (file locations, past decisions, error patterns) and injects relevant context into each new agent turn via semantic search over stored embeddings
Wire LangGraph to orchestrate multi-step agent loops: plan → act → observe → revise, with a human-in-the-loop checkpoint after every 5 actions that presents a diff for approval before continuing
Build a minimal React dashboard that shows live container status, a scrollable action log, token spend per task, and a visual diff of repo changes at task completion
Write an end-to-end test suite that runs three benchmark tasks (add a feature, fix a failing test, refactor a module) and measures success rate, cost, and wall-clock time to establish a personal baseline

Risks

Docker-in-Docker or host socket mounting for sandboxing creates real security exposure if the agent generates and executes malicious code — network egress rules and read-only mounts must be enforced from the start, not retrofitted
Long-running agent sessions will exhaust context windows mid-task; implementing mid-session compaction and memory retrieval correctly is significantly harder than it appears and is likely to be the main time sink
Redis session memory without a strong retrieval strategy will surface irrelevant past context and degrade agent performance — vector embedding and chunking strategy needs to be designed upfront, not bolted on later