AI Pulse
← Projects · 1-week

Multi-Agent Safety Debate Arena

A framework where two specialized LLM agents debate whether a proposed agent action is safe, producing a structured safety verdict without human red-teamers.

Difficulty: 1-week | Stack: Python, Claude API (claude-sonnet-4-5), Pydantic, FastAPI, React, SQLite

Who this is for

AI product teams that need continuous, scalable safety evaluation of agent behaviors but can’t afford human red-teaming at volume — the debate output doubles as an interpretable audit log.

Build steps

  1. Define the debate protocol: given a (task description, proposed action, execution context) triple, a Proposer agent argues the action is safe and a Challenger agent argues it is harmful — each gets one opening statement and one rebuttal, structured as Pydantic models.
  2. Implement the Proposer and Challenger as separate Claude API calls with distinct system prompts: Proposer is optimistic and task-focused, Challenger is adversarial and trained on known attack patterns (prompt injection, PII extraction, SSRF).
  3. Add a Judge agent that receives the full debate transcript and outputs a structured verdict: {safe: bool, confidence: 0-1, key_risk: str, recommended_action: ‘proceed’|‘modify’|‘halt’}.
  4. Wrap the three-agent pipeline in a FastAPI endpoint that accepts action proposals and returns verdicts in <5s, with full debate transcripts stored in SQLite for audit.
  5. Build a React UI showing the debate transcript side-by-side with the verdict, so developers can read the reasoning and calibrate whether the Judge’s thresholds match their risk tolerance.
  6. Evaluate calibration by running 50 known-safe and 50 known-harmful action proposals from a hand-labeled dataset and measuring Judge accuracy, then adjust Challenger system prompt until F1 > 0.80.

Risks

  • Both Proposer and Challenger are drawn from the same model family, so systemic blind spots in Claude’s safety reasoning appear in both agents simultaneously — the debate finds what the model can articulate, not what it can’t see.
  • Prompt injection in the action context could hijack either debate agent’s reasoning, causing the Judge to receive a manipulated transcript — the system being evaluated for safety is also the attack surface.
  • Debate latency (3 sequential LLM calls) makes real-time use impractical for high-frequency agent actions; the tool is better suited for offline batch evaluation or pre-flight checks on task plans rather than per-action gating.

Business Angle

Automated AI safety debate verdicts as a hosted API for teams shipping agentic products without red-team budgets.

Customer: Solo founder or 2-person team building an LLM-powered agent product (e.g. a browser automation SaaS, an AI coding assistant, or an autonomous outreach tool) who has reached ~100 beta users, is fielding safety/compliance questions from early enterprise prospects, and cannot afford a $15k/month red-team engagement.

Pricing: saas-mrr — $800 MRR in 4 months (8 customers at $99/mo)

Full business breakdown →