Multi-Agent Safety Debate Arena

A framework where two specialized LLM agents debate whether a proposed agent action is safe, producing a structured safety verdict without human red-teamers.

Difficulty: 1-week | Stack: Python, Claude API (claude-sonnet-4-5), Pydantic, FastAPI, React, SQLite

Who this is for

AI product teams that need continuous, scalable safety evaluation of agent behaviors but can’t afford human red-teaming at volume — the debate output doubles as an interpretable audit log.

Build steps

Define the debate protocol: given a (task description, proposed action, execution context) triple, a Proposer agent argues the action is safe and a Challenger agent argues it is harmful — each gets one opening statement and one rebuttal, structured as Pydantic models.
Implement the Proposer and Challenger as separate Claude API calls with distinct system prompts: Proposer is optimistic and task-focused, Challenger is adversarial and trained on known attack patterns (prompt injection, PII extraction, SSRF).
Add a Judge agent that receives the full debate transcript and outputs a structured verdict: {safe: bool, confidence: 0-1, key_risk: str, recommended_action: ‘proceed’|‘modify’|‘halt’}.
Wrap the three-agent pipeline in a FastAPI endpoint that accepts action proposals and returns verdicts in <5s, with full debate transcripts stored in SQLite for audit.
Build a React UI showing the debate transcript side-by-side with the verdict, so developers can read the reasoning and calibrate whether the Judge’s thresholds match their risk tolerance.
Evaluate calibration by running 50 known-safe and 50 known-harmful action proposals from a hand-labeled dataset and measuring Judge accuracy, then adjust Challenger system prompt until F1 > 0.80.

Risks

Both Proposer and Challenger are drawn from the same model family, so systemic blind spots in Claude’s safety reasoning appear in both agents simultaneously — the debate finds what the model can articulate, not what it can’t see.
Prompt injection in the action context could hijack either debate agent’s reasoning, causing the Judge to receive a manipulated transcript — the system being evaluated for safety is also the attack surface.
Debate latency (3 sequential LLM calls) makes real-time use impractical for high-frequency agent actions; the tool is better suited for offline batch evaluation or pre-flight checks on task plans rather than per-action gating.

Multi-Agent Safety Debate Arena

Who this is for

Build steps

Risks

Business Angle