AI Pulse
← Projects · 1-week

Agent Behavior Pattern Library (ADRA-Bank Clone)

A personal catalogue of recorded agent trajectories—tagged by failure mode—that you can replay, diff, and query to understand why an agent regressed between versions.

Difficulty: 1-week | Stack: Python, FastAPI, SQLite + SQLAlchemy, Pydantic, Next.js (minimal UI), OpenTelemetry-style trace format

Who this is for

Developer teams iterating rapidly on a production agent who need to answer ‘did this prompt change make the tool-selection worse?’ without re-running a full benchmark suite from scratch.

Build steps

  1. Define a canonical trace schema in Pydantic: each trace captures agent_version, task_id, step list (observation → reasoning → action → result), final_outcome, and a free-text failure_tag (e.g., ‘wrong_tool_order’, ‘hallucinated_api_call’, ‘early_stop’).
  2. Write a thin logging decorator that wraps any LangChain/LangGraph or raw SDK agent loop and serialises traces to SQLite automatically.
  3. Build a FastAPI backend with four endpoints: POST /traces (ingest), GET /traces?tag=&version= (filter), GET /traces/{id}/diff/{id2} (step-level diff between two runs), and GET /stats (per-tag counts by version).
  4. Create a minimal Next.js UI with a trace explorer: a filterable list on the left, a step-by-step timeline on the right, and a two-pane diff view when two traces are selected.
  5. Add a CLI command python bank.py regress --from v1 --to v2 --tag wrong_tool_order that prints whether the failure rate for a given tag went up or down between versions.

Risks

  • Trace schemas ossify quickly—if you hard-code the action format, adding a new tool type later requires a painful migration; use a JSON blob column for the step payload from day one.
  • The diff view is only useful if task IDs are stable across versions; if you don’t fix the random seed or task sampling, the same ‘task’ will be different runs and the diff is meaningless.
  • SQLite write locks become a bottleneck if you run parallel agent evaluations that all write traces simultaneously; switch to WAL mode (PRAGMA journal_mode=WAL) or queue writes through a single worker.

Business Angle

A SaaS tool that lets solo devs and small AI teams record, tag, replay, and diff agent trajectories to catch prompt-regression bugs without re-running expensive benchmarks.

Customer: Indie AI developer or solo founder who has shipped at least one LLM-powered agent to production (e.g. a coding assistant, research bot, or support agent) and is now iterating on prompts/tools weekly — and keeps accidentally breaking behavior they fixed two weeks ago.

Pricing: saas-mrr — $800 MRR in 4 months (16 paying users at $49/mo)

Full business breakdown →